LLM DNA: Tracing Model Evolution via Functional Representations
Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
We introduce LLM DNA, a low-dimensional representation of LLMs, uncovers undocumented relations and constructs phylogenetic tree for LLM.
Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining *LLM DNA* as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies *inheritance* and *genetic determinism* and establish its existence. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on various tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.
- Mathematical definition of LLM DNA as low-dimensional bi-Lipschitz functional behavior representation
- Proof of inheritance and genetic determinism properties of LLM DNA
- Training-free pipeline extracting DNA applicable across diverse models revealing undocumented relationships
- functional representation learning
- phylogenetic analysis
- model comparison
- 305 open-source LLMs
DNA definition does not assign meaning to subsequences despite observations of trait encoding
from the paperDNA extraction not designed to resist adaptive attacks
from the paperDNA extraction depends on six public datasets with potential evaluation bias or contamination
from the paper
Develop theoretical characterization of DNA subsequences encoding distinct model traits
from the paperImprove robustness against adaptive attacks through closed-source extraction data or fresh dataset re-extraction
from the paperScale extraction to mitigate overfitting and evaluation bias
from the paper
Author keywords
- Large Language Model
- Representations
- Fingerprint
- Embedding
- Evolution
- Phylogenetic Tree
- DNA
- Dimension Reduction
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.