Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li zhang, Tianfan Xue, Jian Zhang

LLMs & Reasoning Thu, Apr 23 · 3:15 PM–3:25 PM · 202 A/B Avg rating: 5.00 (4–6)

Abstract

Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Finding that reasoning MLLM generalization in IQA stems from compressing visual information into descriptive text
RACT framework for addressing divergent data distributions through text-image alignment
RALI lightweight framework matching reasoning MLLM performance with 0.3B parameters and no LLM loading

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

contrastive learning
image-text alignment
PCA
K-means clustering
reinforcement learning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Performance ceiling constrained by representational and reasoning capacity of CLIP vision encoder
from the paper
Experiments primarily target natural-image IQA but extensible to video and AIGC quality assessment
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore stronger CLIP variants for improved performance
from the paper
Extend reasoning-aligned lightweight approach to video and AIGC quality assessment
from the paper

Author keywords

Image Quality Assessment
Low Level Vision
Multimodal Large Language Model

Something off? Let us know →

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis