Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
We introduce CRV, a white-box methodology that treats attribution graphs as execution traces, and use it to provide evidence that flawed reasoning has a verifiable computational structure.
Abstract
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into \textit{why} a computation fails. We introduce a white-box method: \textbf{Circuit-based Reasoning Verification (CRV)}. We hypothesize that attribution graphs of correct CoT steps, viewed as \textit{execution traces} of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.
- Introduces Circuit-based Reasoning Verification (CRV), a white-box method using attribution graphs to verify chain-of-thought correctness
- Shows structural signatures of reasoning errors are highly predictive and domain-specific
- Demonstrates causal relationship between computational patterns and reasoning errors via targeted transcoder interventions
- Provides mechanistic understanding of why language models fail to reason correctly
- Attribution graphs
- Sparse autoencoders
- Circuit analysis
- Transcoders
- Chain-of-thought reasoning tasks
- Mathematical reasoning
- Commonsense reasoning
Computational intensity is orders of magnitude more resource-intensive than black-box or gray-box methods
from the paperFeatures used are primarily aggregative, capturing statistical and topological properties rather than semantic content
from the paperAnalysis based on single model family (Llama 3.1) at 8B scale; generalization to larger models and different architectures uncertain
from the paperValidity contingent on quality of underlying interpretability tools like sparse autoencoders and attribution methods
from the paper
Develop sophisticated classifiers or rule-based systems operating on semantic properties of disentangled features
from the paperImprove interpretability tools such as more faithful sparse autoencoders and more precise attribution methods
from the paper
Author keywords
- Mechanistic Interpretability
- Chain-of-Thought Reasoning
- Attribution Graphs
Related orals
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.
Temporal superposition and feature geometry of RNNs under memory demands
Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.
Exploratory Causal Inference in SAEnce
Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.
Addressing divergent representations from causal interventions on neural networks
Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.