MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science
Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi
MedAgentGym is a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in LLM agents.
Abstract
We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.
MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.
- Introduces MedAgentGym with 72,413 task instances across 129 categories from 12 real-world biomedical scenarios
- Provides executable sandbox environments with detailed specifications, feedback mechanisms, and ground truth annotations
- Benchmarks 29 LLMs revealing performance disparities between commercial and open-source models
- Med-Copilot achieves 43% offline and 45% online RL gains, competitive with proprietary LLMs
- Reinforcement learning
- Multi-turn trajectory sampling
- Sandbox environments
- MIMIC-III
- eICU
- Biomedical datasets
Requires substantial computational resources for trajectory sampling and model fine-tuning
from the paperDataset size and trajectory collection constrained by computational budget rather than data availability
from the paperPrimarily supports text and structured data modalities; multimodal extensions needed
from the paperLarge-scale execution-verified single-run evaluation; more extensive uncertainty analyses needed
from the paper
Extend to multimodal biomedical data including medical imaging, EEG, audio, and video signals
from the paperConduct more extensive uncertainty analyses and multi-run evaluations with sufficient resources
from the paper
Author keywords
- Medical Reasoning
- LLM Agent
- Code Generation
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.