Interpretability & Mechanistic Understanding

Mechanistic interpretability, feature visualization, circuit analysis, probing, and explainability.

All papers

Min rating

Sort

Addressing divergent representations from causal interventions on neural networks

Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.

Avg rating: 5.20 (4–8) · Satchel Grant et al.

Exploratory Causal Inference in SAEnce

Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.

Avg rating: 7.00 (4–8) · Tommaso Mencattini et al.

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.

Avg rating: 6.50 (4–10) · Usha Bhalla et al.

Temporal superposition and feature geometry of RNNs under memory demands

Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.

Avg rating: 7.50 (6–8) · Pratyaksh Sharma et al.

Verifying Chain-of-Thought Reasoning via Its Computational Graph

CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.

Avg rating: 6.50 (4–8) · Zheng Zhao et al.