Every Language Model Has a Forgery-Resistant Signature
Matthew Finlayson, Xiang Ren, Swabha Swayamdipta
We show that all language models impose elliptical constraints on their outputs, which can be used as a hard-to-fake signature to identify a model from its outputs.
Abstract
The ubiquity of closed-weight language models with public-facing APIs has generated interest in forensic methods, both for extracting hidden model details (e.g., parameters) and identifying models by their outputs. One successful approach to these goals has been to exploit the geometric constraints imposed by the language model architecture and parameters. In this work, we show that a lesser-known geometric constraint—namely that language model outputs lie on the surface of a high-dimensional ellipse—functions as a signature for the model, which be used to identify which model an output came from. This ellipse signature has unique properties that distinguish it from existing model-output association methods like language model watermarks. In particular, the signature is hard to forge: without direct access to model parameters, it is practically infeasible to produce logprobs on the ellipse. Secondly, the signature is naturally occurring, since all language models have these elliptical constraints. Thirdly, the signature is self-contained, in that it is detectable without access to the model input or full weights. Finally, the signature is exceptionally redundant, as it is independently detectable in every single logprob output from the model. We evaluate a novel technique for extracting the ellipse on small models, and discuss the practical hurdles that make it infeasible for production-size models, making the signature hard to forge. Finally, we use ellipse signatures to propose a protocol for language model output verification, which is analogous to cryptographic symmetric-key message authentication systems.
Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.
- Identifies that language model outputs lie on high-dimensional ellipse surface serving as model signature
- Demonstrates signature is hard to forge without direct access to model parameters
- Proposes protocol for language model output verification analogous to cryptographic authentication
- Signature is naturally occurring and self-contained without needing model input or full weights
- Geometric constraint analysis
- Ellipse extraction
- Model fingerprinting
Hardness of forgery is only polynomial, far from cryptographic security guarantee
from the paperProposed protocol requires API to provide logprobs, which is limited to few major providers
from the paperSignature is not difficult to remove since modifying outputs or parameters breaks ellipse constraints
from the paper
Identify other constraints on model outputs that give stronger security guarantees
from the paperExplore signatures that are difficult to remove as model fingerprints
from the paper
Author keywords
- fingerprint
- watermark
- language model
- signature
- accountability
- cryptography
- forgery
- security
Related orals
LLM Fingerprinting via Semantically Conditioned Watermarks
Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.
Steering the Herd: A Framework for LLM-based Control of Social Learning
Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.
Gaussian certified unlearning in high dimensions: A hypothesis testing approach
Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.
Differentially Private Domain Discovery
WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.