Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models

Rim Assouel, Declan Iain Campbell, Yoshua Bengio, Taylor Whittington Webb

LLMs & Reasoning Fri, Apr 24 · 4:27 PM–4:37 PM · 202 A/B Avg rating: 6.50 (6–8)

Author-provided TL;DR

We describe a set of symbolic-like mechanisms that VLMs use to bind to visual entities in context

Abstract

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by Vision Language Models (VLM). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a previously unknown set of emergent symbolic mechanisms that support binding specifically in VLMs, via a content-independent, spatial indexing scheme. Moreover, we find that binding errors, when they occur, can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for reducing the number of binding failures exhibited by these models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

VLMs employ position IDs as content-independent spatial indices to solve visual binding across object features.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Identification of emergent symbolic mechanisms for visual binding in VLMs using position IDs
Three-stage architecture: ID retrieval heads, ID selection heads, and feature retrieval heads
Architecture consistent across multiple model families suggesting fundamental solution to binding
Binding failures directly traced to failures in spatial indexing mechanisms

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Representational analysis
Causal mediation
Intervention analysis
Attention head analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Open question whether identified mechanisms are truly emergent or driven by architectural inductive biases
from the paper
Investigation of whether emergence is driven by position embeddings or training data distribution
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Develop architectural innovations better supporting spatial indexing like object-centric architectures
from the paper
Strengthen symbolic mechanisms through training strategies like spatial pointing tasks
from the paper

Author keywords

visual object binding
vision-langue model
symbolic reasoning
interpretability

Something off? Let us know →

Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis