ICLR 2026 Orals

Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models

Rim Assouel, Declan Iain Campbell, Yoshua Bengio, Taylor Whittington Webb

LLMs & Reasoning Fri, Apr 24 · 4:27 PM–4:37 PM · 202 A/B Avg rating: 6.50 (6–8)
Author-provided TL;DR

We describe a set of symbolic-like mechanisms that VLMs use to bind to visual entities in context

Abstract

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by Vision Language Models (VLM). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a previously unknown set of emergent symbolic mechanisms that support binding specifically in VLMs, via a content-independent, spatial indexing scheme. Moreover, we find that binding errors, when they occur, can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for reducing the number of binding failures exhibited by these models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

VLMs employ position IDs as content-independent spatial indices to solve visual binding across object features.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Identification of emergent symbolic mechanisms for visual binding in VLMs using position IDs
  • Three-stage architecture: ID retrieval heads, ID selection heads, and feature retrieval heads
  • Architecture consistent across multiple model families suggesting fundamental solution to binding
  • Binding failures directly traced to failures in spatial indexing mechanisms
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Representational analysis
  • Causal mediation
  • Intervention analysis
  • Attention head analysis
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Open question whether identified mechanisms are truly emergent or driven by architectural inductive biases
    from the paper
  • Investigation of whether emergence is driven by position embeddings or training data distribution
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Develop architectural innovations better supporting spatial indexing like object-centric architectures
    from the paper
  • Strengthen symbolic mechanisms through training strategies like spatial pointing tasks
    from the paper

Author keywords

  • visual object binding
  • vision-langue model
  • symbolic reasoning
  • interpretability

Related orals

Something off? Let us know →