LLM DNA: Tracing Model Evolution via Functional Representations

Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He

LLMs & Reasoning Sat, Apr 25 · 3:27 PM–3:37 PM · 203 A/B Avg rating: 5.50 (4–6)

Author-provided TL;DR

We introduce LLM DNA, a low-dimensional representation of LLMs, uncovers undocumented relations and constructs phylogenetic tree for LLM.

Abstract

The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining *LLM DNA* as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies *inheritance* and *genetic determinism* and establish its existence. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on various tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Mathematical definition of LLM DNA as low-dimensional bi-Lipschitz functional behavior representation
Proof of inheritance and genetic determinism properties of LLM DNA
Training-free pipeline extracting DNA applicable across diverse models revealing undocumented relationships

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

functional representation learning
phylogenetic analysis
model comparison

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

305 open-source LLMs

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

DNA definition does not assign meaning to subsequences despite observations of trait encoding
from the paper
DNA extraction not designed to resist adaptive attacks
from the paper
DNA extraction depends on six public datasets with potential evaluation bias or contamination
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Develop theoretical characterization of DNA subsequences encoding distinct model traits
from the paper
Improve robustness against adaptive attacks through closed-source extraction data or fresh dataset re-extraction
from the paper
Scale extraction to mitigate overfitting and evaluation bias
from the paper

Author keywords

Large Language Model
Representations
Fingerprint
Embedding
Evolution
Phylogenetic Tree
DNA
Dimension Reduction

Something off? Let us know →

LLM DNA: Tracing Model Evolution via Functional Representations

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis