ICLR 2026 Orals

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · 203 A/B Avg rating: 6.50 (4–8)
Author-provided TL;DR

MedAgentGym is a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in LLM agents.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces MedAgentGym with 72,413 task instances across 129 categories from 12 real-world biomedical scenarios
  • Provides executable sandbox environments with detailed specifications, feedback mechanisms, and ground truth annotations
  • Benchmarks 29 LLMs revealing performance disparities between commercial and open-source models
  • Med-Copilot achieves 43% offline and 45% online RL gains, competitive with proprietary LLMs
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reinforcement learning
  • Multi-turn trajectory sampling
  • Sandbox environments
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • MIMIC-III
  • eICU
  • Biomedical datasets
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Requires substantial computational resources for trajectory sampling and model fine-tuning
    from the paper
  • Dataset size and trajectory collection constrained by computational budget rather than data availability
    from the paper
  • Primarily supports text and structured data modalities; multimodal extensions needed
    from the paper
  • Large-scale execution-verified single-run evaluation; more extensive uncertainty analyses needed
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Extend to multimodal biomedical data including medical imaging, EEG, audio, and video signals
    from the paper
  • Conduct more extensive uncertainty analyses and multi-run evaluations with sufficient resources
    from the paper

Author keywords

  • Medical Reasoning
  • LLM Agent
  • Code Generation

Related orals

Something off? Let us know →