WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu
A meta-evaluation benchmark for assessing LLM-as-a-judge in the context of web development.
Abstract
The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios.
FALCON enables few-step flow-based sampling with accurate likelihoods for efficient Boltzmann distribution sampling.
- Hybrid training objective encouraging invertibility enabling few-step sampling with likelihood accurate for importance sampling
- Two orders of magnitude faster than equivalent CNF models for Boltzmann distribution sampling
- Practical solution to computationally expensive likelihood evaluations limiting Boltzmann Generator adoption
- Normalizing flows
- Flow matching
- Continuous flows
- Boltzmann generators
Computed likelihoods empirically good but cannot efficiently guarantee theoretical correctness
from the paperTrue one-step generation remains out of reach, further architectural improvements and training methodologies necessary
from the paperPrimary focus on molecular conformation sampling applications
from the paper
Explore application to Bayesian inference, robotics and other domains requiring rapid accurate likelihood estimation
from the paperInvestigate models with structured Jacobians to facilitate faster sampling
from the paper
Author keywords
- large language models
- evaluation
- LLM-as-a-judge
- benchmark
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.