Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio Calmon
We propose that using contextual information to train SAEs will improve their representation of semantic and high-level features.
Abstract
Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.
Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.
- Novel contrastive loss for SAEs that encourages temporal consistency of high-level features across adjacent tokens
- Evidence that SAEs can disentangle semantic from syntactic features without explicit semantic signal
- Framework enabling unsupervised interpretability in language models through temporal structure modeling
- Sparse Autoencoders
- contrastive learning
- dictionary learning
- temporal consistency
Temporal contrastive loss requires smaller batch sizes for same memory budget
from the paperOnly explored single split of feature space into high-level and low-level, not multiple temporal hierarchies
from the paper
Explore multiple temporal hierarchies corresponding to different linguistic levels
from the paperUse learned features as state trackers for detecting significant changes in model behavior
from the paperInvestigate alternative loss formulations more amenable to sparse feature space geometry
from the paper
Author keywords
- Interpretability
- Dictionary Learning
- Machine Learning
- Large Language Models
Related orals
Verifying Chain-of-Thought Reasoning via Its Computational Graph
CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.
Temporal superposition and feature geometry of RNNs under memory demands
Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.
Exploratory Causal Inference in SAEnce
Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.
Addressing divergent representations from causal interventions on neural networks
Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.