Addressing divergent representations from causal interventions on neural networks
Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.
Mechanistic interpretability, feature visualization, circuit analysis, probing, and explainability.
Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.
Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.
Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.
Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.
CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.