EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue

Datasets, Benchmarks & Evaluation Fri, Apr 24 · 3:39 PM–3:49 PM · 203 A/B Avg rating: 7.50 (6–10)

OpenReview ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We propose a new benchmark for evaluating an LLM's ability to perform code edits. Our data is gathered from in-the-wild code edits, leading to more realistic problems.

Abstract

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60\%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11\%, indicating the importance of evaluating with realistic context.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Creates EditBench benchmark grounded in real-world user instructions and code contexts from wild
Introduces context-dependent problems requiring understanding of code context, highlighting, and cursor position
Evaluates 40 diverse LLMs showing only 3 score over 60%, indicating benchmark difficulty
Demonstrates performance varies across edit categories and contextual information greatly affects success

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

EditBench

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Despite diversity efforts, unclear if problems encapsulate all real-world use cases
from the paper
Limited to existing languages; need to expand to additional common programming languages
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Continue collecting data using VS Code extension to increase examples in existing languages
from the paper
Expand to additional common programming languages
from the paper
Continue updating leaderboard as new models released
from the paper
Explore automatic workflows for translating real-world data to benchmark problems
from the paper

Author keywords

code
real-world
llm
code edit
edit

Something off? Let us know →

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Abstract

Author keywords

Related orals

On the Wasserstein Geodesic Principal Component Analysis of probability measures

TabStruct: Measuring Structural Fidelity of Tabular Data

Monocular Normal Estimation via Shading Sequence Estimation

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

World-In-World: World Models in a Closed-Loop World