Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang

LLMs & Reasoning Thu, Apr 23 · 10:42 AM–10:52 AM · 201 C Avg rating: 6.00 (6–6)

Author-provided TL;DR

We highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content through steganography.

Abstract

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API’s safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes malicious finetuning method using steganography to bypass LLM safety alignments
Demonstrates attack success on GPT-4.1 and open-source models despite OpenAI finetuning safeguards
Shows steganographic malicious outputs evade Llama-Guard-3 content safety classifier
Exposes blind spots in current safety mechanisms and proposes mitigation strategies

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Steganography
Finetuning
Adversarial attacks

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

AdvBench

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

For some malicious prompts, decoded outputs remain benign or deviate from intended targets
from the paper
Method can be improved in terms of proportion and quality of malicious responses
from the paper
Steganographic method increases token generation significantly, reducing response efficiency
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Investigate more efficient steganographic techniques
from the paper

Author keywords

LLM
finetuning
safety
steganography

Something off? Let us know →

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference