ICLR 2026 Orals

LLM DNA: Tracing Model Evolution via Functional Representations

Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He

LLMs & Reasoning Sat, Apr 25 · 3:27 PM–3:37 PM · 203 A/B Avg rating: 5.50 (4–6)
Author-provided TL;DR

We introduce LLM DNA, a low-dimensional representation of LLMs, uncovers undocumented relations and constructs phylogenetic tree for LLM.

Abstract

The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining *LLM DNA* as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies *inheritance* and *genetic determinism* and establish its existence. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on various tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Mathematical definition of LLM DNA as low-dimensional bi-Lipschitz functional behavior representation
  • Proof of inheritance and genetic determinism properties of LLM DNA
  • Training-free pipeline extracting DNA applicable across diverse models revealing undocumented relationships
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • functional representation learning
  • phylogenetic analysis
  • model comparison
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • 305 open-source LLMs
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • DNA definition does not assign meaning to subsequences despite observations of trait encoding
    from the paper
  • DNA extraction not designed to resist adaptive attacks
    from the paper
  • DNA extraction depends on six public datasets with potential evaluation bias or contamination
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Develop theoretical characterization of DNA subsequences encoding distinct model traits
    from the paper
  • Improve robustness against adaptive attacks through closed-source extraction data or fresh dataset re-extraction
    from the paper
  • Scale extraction to mitigate overfitting and evaluation bias
    from the paper

Author keywords

  • Large Language Model
  • Representations
  • Fingerprint
  • Embedding
  • Evolution
  • Phylogenetic Tree
  • DNA
  • Dimension Reduction

Related orals

Something off? Let us know →