SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, Moontae Lee

LLMs & Reasoning Sat, Apr 25 · 11:30 AM–11:40 AM · 202 A/B Avg rating: 6.50 (4–8)

Author-provided TL;DR

This work introduces a simple yet principled approach for directly optimizing the safety alignment objective during policy learning

Abstract

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the original safety alignment objective and show that, under mild assumptions, it admits a closed-form optimal policy. We further derive a provably equivalent and tractable objective, enabling direct optimization. Building on this insight, we propose SafeDPO, a lightweight method that preserves the optimal solution of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. SafeDPO eliminates the need for reward models, cost models, and online sampling, relying only on preference data and safety indicators. Despite its simplicity, SafeDPO achieves competitive safety–helpfulness trade-offs compared to existing safety alignment methods. Experiments on the PKU-SafeRLHF-30K benchmark demonstrate that SafeDPO substantially improves safety while maintaining competitive helpfulness. Ablation studies further show that the additional hyperparameter provides a flexible mechanism to enhance safety while preserving the theoretical optimum, and confirm that SafeDPO scales reliably to LLMs with up to 13B parameters. Overall, our results highlight that a simple, theory-driven objective can provide a lightweight yet effective solution for safety alignment in practice.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

SafeDPO reformulates safety alignment as closed-form objective, achieving strong safety-helpfulness trade-offs without auxiliary models.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Revisits original safety alignment objective showing it admits closed-form optimal policy under mild assumptions
Derives provably equivalent and tractable objective enabling direct optimization
Proposes SafeDPO preserving optimal solution with only one additional hyperparameter
Eliminates need for reward models, cost models, and online sampling while maintaining theoretical guarantees

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Direct preference optimization
Safety constraints
Preference learning

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

PKU-SafeRLHF-30K

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Evaluation primarily on PKU-SafeRLHF dataset which may not capture diversity of real-world safety scenarios
from the paper
Experiments limited to models up to 13B due to memory constraints
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend evaluation to broader datasets and larger-scale models beyond 13B parameters
from the paper

Author keywords

Safety Alignment
LLM Fine-tuning
Preferences
Large Language Models
AI Safety

Something off? Let us know →

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis