Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio Calmon

Interpretability & Mechanistic Understanding Thu, Apr 23 · 4:03 PM–4:13 PM · 201 C Avg rating: 6.50 (4–10)

Author-provided TL;DR

We propose that using contextual information to train SAEs will improve their representation of semantic and high-level features.

Abstract

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Novel contrastive loss for SAEs that encourages temporal consistency of high-level features across adjacent tokens
Evidence that SAEs can disentangle semantic from syntactic features without explicit semantic signal
Framework enabling unsupervised interpretability in language models through temporal structure modeling

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Sparse Autoencoders
contrastive learning
dictionary learning
temporal consistency

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Temporal contrastive loss requires smaller batch sizes for same memory budget
from the paper
Only explored single split of feature space into high-level and low-level, not multiple temporal hierarchies
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore multiple temporal hierarchies corresponding to different linguistic levels
from the paper
Use learned features as state trackers for detecting significant changes in model behavior
from the paper
Investigate alternative loss formulations more amenable to sparse feature space geometry
from the paper

Author keywords

Interpretability
Dictionary Learning
Machine Learning
Large Language Models

Something off? Let us know →

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Abstract

Author keywords

Related orals

Verifying Chain-of-Thought Reasoning via Its Computational Graph

Temporal superposition and feature geometry of RNNs under memory demands

Exploratory Causal Inference in SAEnce

Addressing divergent representations from causal interventions on neural networks