Verifying Chain-of-Thought Reasoning via Its Computational Graph

Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda

Interpretability & Mechanistic Understanding Thu, Apr 23 · 11:06 AM–11:16 AM · Amphitheater Avg rating: 6.50 (4–8)

Author-provided TL;DR

We introduce CRV, a white-box methodology that treats attribution graphs as execution traces, and use it to provide evidence that flawed reasoning has a verifiable computational structure.

Abstract

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into \textit{why} a computation fails. We introduce a white-box method: \textbf{Circuit-based Reasoning Verification (CRV)}. We hypothesize that attribution graphs of correct CoT steps, viewed as \textit{execution traces} of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces Circuit-based Reasoning Verification (CRV), a white-box method using attribution graphs to verify chain-of-thought correctness
Shows structural signatures of reasoning errors are highly predictive and domain-specific
Demonstrates causal relationship between computational patterns and reasoning errors via targeted transcoder interventions
Provides mechanistic understanding of why language models fail to reason correctly

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Attribution graphs
Sparse autoencoders
Circuit analysis
Transcoders

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Chain-of-thought reasoning tasks
Mathematical reasoning
Commonsense reasoning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Computational intensity is orders of magnitude more resource-intensive than black-box or gray-box methods
from the paper
Features used are primarily aggregative, capturing statistical and topological properties rather than semantic content
from the paper
Analysis based on single model family (Llama 3.1) at 8B scale; generalization to larger models and different architectures uncertain
from the paper
Validity contingent on quality of underlying interpretability tools like sparse autoencoders and attribution methods
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Develop sophisticated classifiers or rule-based systems operating on semantic properties of disentangled features
from the paper
Improve interpretability tools such as more faithful sparse autoencoders and more precise attribution methods
from the paper

Author keywords

Mechanistic Interpretability
Chain-of-Thought Reasoning
Attribution Graphs

Something off? Let us know →

Verifying Chain-of-Thought Reasoning via Its Computational Graph

Abstract

Author keywords

Related orals

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Temporal superposition and feature geometry of RNNs under memory demands

Exploratory Causal Inference in SAEnce

Addressing divergent representations from causal interventions on neural networks