ICLR 2026 Orals

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

LLMs & Reasoning Fri, Apr 24 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 5.50 (4–8)

Abstract

Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects (i.e., specific evaluation criteria such as fluency for text and image quality for images). In this paper, we argue that, on one hand, based on the interconnected nature of criteria, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual criteria and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks --- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • FRABench: 60.4k pairwise samples with 325k labels across 112 aspects and 4 evaluation tasks
  • Hierarchical aspect taxonomy for NLG, image understanding, image generation, and interleaved text-image generation
  • UFEval demonstrating aspect and task generalization enabling evaluation on unseen aspects
  • Automatic high-quality preference pair dataset construction for DPO alignment training
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Multi-modal evaluation
  • Aspect-level assessment
  • Task generalization
  • MLLM-as-judge
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • UFEval has limited performance on image generation tasks due to insufficient visual semantic understanding
    from the paper
  • Benchmark covers four tasks but additional task types may be beneficial
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Incorporate video understanding and generation tasks into evaluation system
    from the paper
  • Add corresponding aspects to aspect tree
    from the paper

Author keywords

  • Aspect-level Evaluation Dataset
  • Unified Fine-grained Evaluation

Related orals

Something off? Let us know →