Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li zhang, Tianfan Xue, Jian Zhang
Abstract
Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.
RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.
- Finding that reasoning MLLM generalization in IQA stems from compressing visual information into descriptive text
- RACT framework for addressing divergent data distributions through text-image alignment
- RALI lightweight framework matching reasoning MLLM performance with 0.3B parameters and no LLM loading
- contrastive learning
- image-text alignment
- PCA
- K-means clustering
- reinforcement learning
Performance ceiling constrained by representational and reasoning capacity of CLIP vision encoder
from the paperExperiments primarily target natural-image IQA but extensible to video and AIGC quality assessment
from the paper
Explore stronger CLIP variants for improved performance
from the paperExtend reasoning-aligned lightweight approach to video and AIGC quality assessment
from the paper
Author keywords
- Image Quality Assessment
- Low Level Vision
- Multimodal Large Language Model
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.