The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Aideen Fay, Inés García-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod

LLMs & Reasoning Thu, Apr 23 · 10:54 AM–11:04 AM · 201 C Avg rating: 6.00 (4–8)

Author-provided TL;DR

We use persistent homology to interpret how adversarial inputs reshape LLM representation spaces, resulting in a robust signature that provides multiscale, geometry-aware insights complementary to standard interpretability methods.

Abstract

Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine-tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Applies persistent homology to study geometry and topology of LLM representations under adversarial attacks
Discovers topological compression signature across architectures and attack vectors
Reveals architecture-agnostic geometric invariants complementary to linear interpretability methods

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Persistent homology
Topological data analysis
Vietoris-Rips filtrations
Activation analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Does not attempt to interpret semantic content of cycles and topological features
from the paper
Computes Vietoris-Rips filtrations only in dimensions 0 and 1
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Future research might adapt classical PH techniques to account for specific architectural features of LLMs, producing more interpretable features that map to semantic content
from the paper
Investigation needed into whether topological compression is a general property of model misalignment and adversarial attacks
from the paper
Further investigation needed into how topological awareness might be leveraged during model training and architecture design
from the paper

Author keywords

Persistent Homology
Interpretability
Topological Data Analysis
Representation Geometry
Large Language Models
AI Security
Adversarial Attacks
Sparse Autoencoders

Something off? Let us know →

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis