WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu

LLMs & Reasoning Sat, Apr 25 · 4:15 PM–4:25 PM · 204 A/B Avg rating: 6.50 (4–8)

Author-provided TL;DR

A meta-evaluation benchmark for assessing LLM-as-a-judge in the context of web development.

Abstract

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

FALCON enables few-step flow-based sampling with accurate likelihoods for efficient Boltzmann distribution sampling.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Hybrid training objective encouraging invertibility enabling few-step sampling with likelihood accurate for importance sampling
Two orders of magnitude faster than equivalent CNF models for Boltzmann distribution sampling
Practical solution to computationally expensive likelihood evaluations limiting Boltzmann Generator adoption

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Normalizing flows
Flow matching
Continuous flows
Boltzmann generators

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Computed likelihoods empirically good but cannot efficiently guarantee theoretical correctness
from the paper
True one-step generation remains out of reach, further architectural improvements and training methodologies necessary
from the paper
Primary focus on molecular conformation sampling applications
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore application to Bayesian inference, robotics and other domains requiring rapid accurate likelihood estimation
from the paper
Investigate models with structured Jacobians to facilitate faster sampling
from the paper

Author keywords

large language models
evaluation
LLM-as-a-judge
benchmark

Something off? Let us know →

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis