Generative Universal Verifier as Multimodal Meta-Reasoner
Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang, Guang Shi
Abstract
We introduce *Generative Universal Verifier*, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build **ViVerBench**, a comprehensive benchmark spanning $16$ categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train **OmniVerifier-7B**, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+$8.3$). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose **OmniVerifier-TTS**, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+$3.7$), and GenEval++(+$4.3$), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
OmniVerifier provides universal visual verification for multimodal reasoning and introduces sequential test-time scaling for image generation and editing.
- Introduces ViVerBench, comprehensive benchmark spanning 16 categories for evaluating visual outcomes in multimodal reasoning
- Designs OmniVerifier-7B, first omni-capable generative verifier trained for universal visual verification achieving gains on ViVerBench
- Proposes OmniVerifier-TTS, sequential test-time scaling paradigm leveraging universal verifier to enhance generative ability through iterative refinement
- Automated visual verification pipelines
- Test-time scaling
- Generative verification
- T2I-ReasonBench
- GenEval++
Certain tasks with large domain gap (e.g., maze) generalize less effectively and require task-specific data
from the paperOmniVerifier-TTS sensitive to distribution of images generated or edited by unified multimodal models; some models exhibit unusual behaviors under multi-step self-refinement
from the paperStyle artifacts observed in some models (e.g., yellowish color after iterative edits) though these don't compromise verification performance
from the paper
Improve OmniVerifier's generalization across diverse tasks through enhanced training and data construction strategies
from the paperEnhance style consistency under multi-step self-refinement in unified multimodal models
from the paper
Author keywords
- Multimodal Large Language Models
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.