The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology
Aideen Fay, Inés García-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod
We use persistent homology to interpret how adversarial inputs reshape LLM representation spaces, resulting in a robust signature that provides multiscale, geometry-aware insights complementary to standard interpretability methods.
Abstract
Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine-tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.
Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.
- Applies persistent homology to study geometry and topology of LLM representations under adversarial attacks
- Discovers topological compression signature across architectures and attack vectors
- Reveals architecture-agnostic geometric invariants complementary to linear interpretability methods
- Persistent homology
- Topological data analysis
- Vietoris-Rips filtrations
- Activation analysis
Does not attempt to interpret semantic content of cycles and topological features
from the paperComputes Vietoris-Rips filtrations only in dimensions 0 and 1
from the paper
Future research might adapt classical PH techniques to account for specific architectural features of LLMs, producing more interpretable features that map to semantic content
from the paperInvestigation needed into whether topological compression is a general property of model misalignment and adversarial attacks
from the paperFurther investigation needed into how topological awareness might be leveraged during model training and architecture design
from the paper
Author keywords
- Persistent Homology
- Interpretability
- Topological Data Analysis
- Representation Geometry
- Large Language Models
- AI Security
- Adversarial Attacks
- Sparse Autoencoders
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.