Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

LLMs & Reasoning Fri, Apr 24 · 3:39 PM–3:49 PM · 202 A/B Avg rating: 6.40 (6–8)

Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-$55$K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

VC-STaR mitigates visual hallucinations through contrastive VQA pairs for self-improving visual reasoning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Observation that VLMs identify visual cues more precisely when presented with contrastive VQA pairs
Self-improving framework leveraging visual contrast to mitigate hallucinations in model-generated rationales
VisCoR-55K dataset curated with contrastive pairs according to multi-modal similarity
Outperforms self-improving baselines and models trained on SoTA visual reasoning datasets

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Self-improving learning
Contrastive visual reasoning
VQA pairs
Supervised fine-tuning

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

VisCoR-55K

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Reasoning
Vision-Language Models
Contrasting

Something off? Let us know →

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis