AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi Mishra, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefanus Candra, Jason Dunkelberger, Daniel Emery, Rob Evans, Malachi Hamada, Regan Huff, Rodney Kinney, Matt Latzke, Jaron Lochner, Ruben Lozano-Aguilera, Ngoc-Uyen Nguyen, Smita Rao, Amber Tanaka, Brooke Vlahos, Peter Clark, Doug Downey, Yoav Goldberg, Ashish Sabharwal, Daniel S Weld

Reinforcement Learning & Agents Fri, Apr 24 · 4:03 PM–4:13 PM · 203 A/B Avg rating: 7.00 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We present principles and tooling for rigorous AI agent benchmarking, instantiated in AstaBench—the first holistic measure of agentic ability for scientific research—plus experiments showing AI remains far from solving research assistance.

Abstract

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Presents AstaBench, comprehensive benchmark suite with production-grade tools for rigorous evaluation of AI agents on scientific research tasks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

First major agent benchmark with standardized environment and realistic, controlled search tools enabling reproducible agent comparison
Asta Environment, first scientific research environment with production-grade search tools and controlled document corpus
Comprehensive suite of 9 science-optimized agent classes and baselines; evaluation of 57 agents across 22 architectural classes

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Benchmark design
Agent evaluation
Search tools
LLM-as-judge grading

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Problems primarily focused on Computer Science domain; weighted toward literature understanding tasks
from the paper
Standard tools in benchmark may not fully reflect diversity of specialized research domains
from the paper
Requires defining benchmark suite and evaluating agents; time-invariant cost accounting important for fair comparison
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Actively pushing performance-cost frontiers and closing gap for truly open agents through new agent techniques and tools
from the paper
Enhancing agent abilities to manage complex context and handle long-duration tasks in complex research projects
from the paper
Refining LLM-as-a-judge grading procedures especially for challenging scientific discovery tasks
from the paper
Developing fresh benchmark problems using latest scientific knowledge, contamination-resistant and past training cutoff dates
from the paper
Building benchmarks testing collaboration with humans and deepening coverage in impactful fields like biomedicine
from the paper
Continuing to measure latest advances through latest LLMs and additional agent architectures
from the paper

Author keywords

Agents
evaluation
benchmarks
scientific research

Something off? Let us know →

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Abstract

Author keywords

Related orals

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport