Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models
Rim Assouel, Declan Iain Campbell, Yoshua Bengio, Taylor Whittington Webb
We describe a set of symbolic-like mechanisms that VLMs use to bind to visual entities in context
Abstract
To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by Vision Language Models (VLM). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a previously unknown set of emergent symbolic mechanisms that support binding specifically in VLMs, via a content-independent, spatial indexing scheme. Moreover, we find that binding errors, when they occur, can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for reducing the number of binding failures exhibited by these models.
VLMs employ position IDs as content-independent spatial indices to solve visual binding across object features.
- Identification of emergent symbolic mechanisms for visual binding in VLMs using position IDs
- Three-stage architecture: ID retrieval heads, ID selection heads, and feature retrieval heads
- Architecture consistent across multiple model families suggesting fundamental solution to binding
- Binding failures directly traced to failures in spatial indexing mechanisms
- Representational analysis
- Causal mediation
- Intervention analysis
- Attention head analysis
Open question whether identified mechanisms are truly emergent or driven by architectural inductive biases
from the paperInvestigation of whether emergence is driven by position embeddings or training data distribution
from the paper
Develop architectural innovations better supporting spatial indexing like object-centric architectures
from the paperStrengthen symbolic mechanisms through training strategies like spatial pointing tasks
from the paper
Author keywords
- visual object binding
- vision-langue model
- symbolic reasoning
- interpretability
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.