Generative Universal Verifier as Multimodal Meta-Reasoner

Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang, Guang Shi

LLMs & Reasoning Fri, Apr 24 · 10:42 AM–10:52 AM · 203 A/B Avg rating: 8.00 (8–8)

Abstract

We introduce *Generative Universal Verifier*, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build **ViVerBench**, a comprehensive benchmark spanning $16$ categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train **OmniVerifier-7B**, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+$8.3$). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose **OmniVerifier-TTS**, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+$3.7$), and GenEval++(+$4.3$), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

OmniVerifier provides universal visual verification for multimodal reasoning and introduces sequential test-time scaling for image generation and editing.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces ViVerBench, comprehensive benchmark spanning 16 categories for evaluating visual outcomes in multimodal reasoning
Designs OmniVerifier-7B, first omni-capable generative verifier trained for universal visual verification achieving gains on ViVerBench
Proposes OmniVerifier-TTS, sequential test-time scaling paradigm leveraging universal verifier to enhance generative ability through iterative refinement

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Automated visual verification pipelines
Test-time scaling
Generative verification

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

T2I-ReasonBench
GenEval++

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Certain tasks with large domain gap (e.g., maze) generalize less effectively and require task-specific data
from the paper
OmniVerifier-TTS sensitive to distribution of images generated or edited by unified multimodal models; some models exhibit unusual behaviors under multi-step self-refinement
from the paper
Style artifacts observed in some models (e.g., yellowish color after iterative edits) though these don't compromise verification performance
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Improve OmniVerifier's generalization across diverse tasks through enhanced training and data construction strategies
from the paper
Enhance style consistency under multi-step self-refinement in unified multimodal models
from the paper

Author keywords

Multimodal Large Language Models

Something off? Let us know →

Generative Universal Verifier as Multimodal Meta-Reasoner

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis