ICLR 2026 Orals

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

LLMs & Reasoning Fri, Apr 24 · 3:39 PM–3:49 PM · 202 A/B Avg rating: 6.40 (6–8)

Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent fine-tuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge: visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely compared with when given a single VQA sample. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-$55$K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. The code, dataset and trained models will be released upon acceptance.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

VC-STaR mitigates visual hallucinations through contrastive VQA pairs for self-improving visual reasoning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Observation that VLMs identify visual cues more precisely when presented with contrastive VQA pairs
  • Self-improving framework leveraging visual contrast to mitigate hallucinations in model-generated rationales
  • VisCoR-55K dataset curated with contrastive pairs according to multi-modal similarity
  • Outperforms self-improving baselines and models trained on SoTA visual reasoning datasets
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Self-improving learning
  • Contrastive visual reasoning
  • VQA pairs
  • Supervised fine-tuning
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • VisCoR-55K
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Reasoning
  • Vision-Language Models
  • Contrasting

Related orals

Something off? Let us know →