From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi, Nived Rajaraman, Xiuying Wei, Razvan Pascanu, Caglar Gulcehre, Michael Gastpar, Ashok Vardhan Makkuva

LLMs & Reasoning Thu, Apr 23 · 4:03 PM–4:13 PM · 201 A/B Avg rating: 7.50 (6–8)

Author-provided TL;DR

We uncover an interesting phenomenon where a single-layer Mamba represents the Bayes optimal Laplacian smoothing estimator when trained on Markov chains and we demonstrate it theoretically and empirically.

Abstract

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed-ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering an interesting phenomenon: even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Characterizes in-context learning capabilities of Mamba, showing it learns optimal Laplacian smoothing estimator.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

First formal connection between Mamba and optimal statistical estimators for in-context learning
Shows single-layer Mamba efficiently learns Laplacian smoothing on Markov chains
Theoretically characterizes representation capacity revealing fundamental role of convolution in Mamba

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Mamba architecture
State space models
Convolution
In-context learning
Statistical estimation

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend results to deeper Mamba models
from the paper
Investigate Mamba's learning dynamics
from the paper

Author keywords

State-space models
Markov chains
In-context learning
Laplacian smoothing

Something off? Let us know →

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis