ICLR 2026 Orals

Verifying Chain-of-Thought Reasoning via Its Computational Graph

Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda

Interpretability & Mechanistic Understanding Thu, Apr 23 · 11:06 AM–11:16 AM · Amphitheater Avg rating: 6.50 (4–8)
Author-provided TL;DR

We introduce CRV, a white-box methodology that treats attribution graphs as execution traces, and use it to provide evidence that flawed reasoning has a verifiable computational structure.

Abstract

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into \textit{why} a computation fails. We introduce a white-box method: \textbf{Circuit-based Reasoning Verification (CRV)}. We hypothesize that attribution graphs of correct CoT steps, viewed as \textit{execution traces} of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces Circuit-based Reasoning Verification (CRV), a white-box method using attribution graphs to verify chain-of-thought correctness
  • Shows structural signatures of reasoning errors are highly predictive and domain-specific
  • Demonstrates causal relationship between computational patterns and reasoning errors via targeted transcoder interventions
  • Provides mechanistic understanding of why language models fail to reason correctly
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Attribution graphs
  • Sparse autoencoders
  • Circuit analysis
  • Transcoders
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Chain-of-thought reasoning tasks
  • Mathematical reasoning
  • Commonsense reasoning
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Computational intensity is orders of magnitude more resource-intensive than black-box or gray-box methods
    from the paper
  • Features used are primarily aggregative, capturing statistical and topological properties rather than semantic content
    from the paper
  • Analysis based on single model family (Llama 3.1) at 8B scale; generalization to larger models and different architectures uncertain
    from the paper
  • Validity contingent on quality of underlying interpretability tools like sparse autoencoders and attribution methods
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Develop sophisticated classifiers or rule-based systems operating on semantic properties of disentangled features
    from the paper
  • Improve interpretability tools such as more faithful sparse autoencoders and more precise attribution methods
    from the paper

Author keywords

  • Mechanistic Interpretability
  • Chain-of-Thought Reasoning
  • Attribution Graphs

Related orals

Something off? Let us know →