How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Shawn Im, Changdae Oh, Zhen Fang, Sharon Li

LLMs & Reasoning Thu, Apr 23 · 3:27 PM–3:37 PM · 201 A/B Avg rating: 7.20 (6–8)

Abstract

Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions--bigram, token-interchangeability, and context mappings--reflecting the statistics in the text corpus and uncover how each component of the transformer captures the semantic association based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further guide us on how our theorem shines light on interpreting the learned association in transformers.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Develops closed-form expressions for transformer weights at early training stages using leading-term gradient approximation
Reveals transformer weights decompose as compositions of three basis functions: bigram mapping, token-interchangeability mapping, and context mapping
Demonstrates theoretical weight characterizations closely match learned weights in real-world LLMs

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Gradient leading-term analysis
Closed-form weight expressions
Training dynamics analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Discover common factors allowing weight matrices across components to be decomposed into simple functions of shared factors
from the paper
Leverage theory to formulate broad hypotheses about how concepts arise in models, extending beyond individual mechanisms
from the paper

Author keywords

Semantic associations
Interpretability
LLM

Something off? Let us know →

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis