MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · 203 A/B Avg rating: 6.50 (4–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

MedAgentGym is a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in LLM agents.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces MedAgentGym with 72,413 task instances across 129 categories from 12 real-world biomedical scenarios
Provides executable sandbox environments with detailed specifications, feedback mechanisms, and ground truth annotations
Benchmarks 29 LLMs revealing performance disparities between commercial and open-source models
Med-Copilot achieves 43% offline and 45% online RL gains, competitive with proprietary LLMs

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Reinforcement learning
Multi-turn trajectory sampling
Sandbox environments

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MIMIC-III
eICU
Biomedical datasets

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Requires substantial computational resources for trajectory sampling and model fine-tuning
from the paper
Dataset size and trajectory collection constrained by computational budget rather than data availability
from the paper
Primarily supports text and structured data modalities; multimodal extensions needed
from the paper
Large-scale execution-verified single-run evaluation; more extensive uncertainty analyses needed
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend to multimodal biomedical data including medical imaging, EEG, audio, and video signals
from the paper
Conduct more extensive uncertainty analyses and multi-run evaluations with sufficient resources
from the paper

Author keywords

Medical Reasoning
LLM Agent
Code Generation

Something off? Let us know →

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis