Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev

LLMs & Reasoning Thu, Apr 23 · 11:06 AM–11:16 AM · 201 C Avg rating: 6.50 (4–8)

Author-provided TL;DR

We show that adversaries can implant hidden adversarial behaviors in LLM that are inadvertently triggered by users finetuning the model.

Abstract

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

FAB enables adversaries to create compromised LLMs that exhibit dormant adversarial behaviors triggered only during downstream finetuning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces finetuning-activated adversarial behaviors (FAB) attack via meta-learning to create benign-appearing models with latent malicious behavior
Demonstrates effectiveness across multiple LLMs and three attack scenarios: unsolicited advertising, jailbreakability, and over-refusal
Shows FAB triggers are robust to various user finetuning choices across different datasets, learning rates, and post-training algorithms

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Meta-learning
Finetuning
Instruction-tuning
DPO

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Alpaca
OpenMathInstruct
AdvBench
Dolly
PubMedQA
OpenCoder

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Effectiveness depends on carefully chosen parameters, datasets, and loss functions which introduces initial overhead for an adversary
from the paper
Due to meta-learning optimization, adversaries require more computational resources than traditional finetuning, limiting exploration to smaller models (≤3B parameters)
from the paper
Evaluating generalizability of FAB to larger models remains an important direction for future work
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Evaluate generalizability of FAB to larger models beyond 3B parameters
from the paper
Develop technical mitigations for finetuning-activated attacks
from the paper

Author keywords

LLM
Finetuning
Safety

Something off? Let us know →

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis