LLM Fingerprinting via Semantically Conditioned Watermarks
Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
We introduce a robust LLM fingerprinting method based on semantically conditioned watermarks
Abstract
Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce *LLM fingerprinting via semantically conditioned watermarks*, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.
Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.
- Novel fingerprinting method using domain-specific watermarks instead of fixed query-response pairs
- Watermarking signal diffused throughout responses rather than in atypical keys
- Robust to common deployment steps like finetuning and quantization
- Statistical watermarking
- Semantic domain conditioning
- Language model fine-tuning
- AlpacaGPT4
- OpenWebText
- OpenMathInstruct
- C4
- GSM8K
- Wikipedia
- WildChat
Method requires selecting a semantic domain where model distribution is distorted, which may degrade performance for some users
from the paperFingerprint stealth relies partly on adversaries not knowing the semantic domain beforehand; if domain is known, adversaries could prevent detection by blocking related queries
from the paper
Authors did not state explicit future directions.
Author keywords
- LLM
- Watermarks
- Fingerprinting
Related orals
Steering the Herd: A Framework for LLM-based Control of Social Learning
Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.
Every Language Model Has a Forgery-Resistant Signature
Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.
Gaussian certified unlearning in high dimensions: A hypothesis testing approach
Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.
Differentially Private Domain Discovery
WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.