ICLR 2026 Orals

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio Calmon

Interpretability & Mechanistic Understanding Thu, Apr 23 · 4:03 PM–4:13 PM · 201 C Avg rating: 6.50 (4–10)
Author-provided TL;DR

We propose that using contextual information to train SAEs will improve their representation of semantic and high-level features.

Abstract

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Novel contrastive loss for SAEs that encourages temporal consistency of high-level features across adjacent tokens
  • Evidence that SAEs can disentangle semantic from syntactic features without explicit semantic signal
  • Framework enabling unsupervised interpretability in language models through temporal structure modeling
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Sparse Autoencoders
  • contrastive learning
  • dictionary learning
  • temporal consistency
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Temporal contrastive loss requires smaller batch sizes for same memory budget
    from the paper
  • Only explored single split of feature space into high-level and low-level, not multiple temporal hierarchies
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Explore multiple temporal hierarchies corresponding to different linguistic levels
    from the paper
  • Use learned features as state trackers for detecting significant changes in model behavior
    from the paper
  • Investigate alternative loss formulations more amenable to sparse feature space geometry
    from the paper

Author keywords

  • Interpretability
  • Dictionary Learning
  • Machine Learning
  • Large Language Models

Related orals

Something off? Let us know →