FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

LLMs & Reasoning Fri, Apr 24 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 5.50 (4–8)

Abstract

Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects (i.e., specific evaluation criteria such as fluency for text and image quality for images). In this paper, we argue that, on one hand, based on the interconnected nature of criteria, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual criteria and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks --- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

FRABench: 60.4k pairwise samples with 325k labels across 112 aspects and 4 evaluation tasks
Hierarchical aspect taxonomy for NLG, image understanding, image generation, and interleaved text-image generation
UFEval demonstrating aspect and task generalization enabling evaluation on unseen aspects
Automatic high-quality preference pair dataset construction for DPO alignment training

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Multi-modal evaluation
Aspect-level assessment
Task generalization
MLLM-as-judge

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

UFEval has limited performance on image generation tasks due to insufficient visual semantic understanding
from the paper
Benchmark covers four tasks but additional task types may be beneficial
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Incorporate video understanding and generation tasks into evaluation system
from the paper
Add corresponding aspects to aspect tree
from the paper

Author keywords

Aspect-level Evaluation Dataset
Unified Fine-grained Evaluation

Something off? Let us know →

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis