Safety, Privacy & Alignment

AI safety, alignment, jailbreaks, adversarial robustness, privacy, differential privacy, and membership inference.

All papers

Min rating

Sort

Differentially Private Domain Discovery

WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.

Avg rating: 6.50 (6–8) · Vinod Raman et al.

EigenBench: A Comparative Behavioral Measure of Value Alignment

EigenBench measures language model value alignment using model ensemble judgments aggregated with EigenTrust without ground truth labels.

Avg rating: 6.00 (4–10) · Jonathn Chang et al.

Every Language Model Has a Forgery-Resistant Signature

Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.

Avg rating: 6.00 (4–8) · Matthew Finlayson et al.

Gaussian certified unlearning in high dimensions: A hypothesis testing approach

Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.

Avg rating: 6.00 (4–8) · Aaradhya Pandey et al.

Hubble: a Model Suite to Advance the Study of LLM Memorization

Releases Hubble suite of open-source LLMs with controlled perturbed variants to systematically study memorization risks.

Avg rating: 7.50 (6–8) · Johnny Wei et al.

LLM Fingerprinting via Semantically Conditioned Watermarks

Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.

Avg rating: 6.50 (6–8) · Thibaud Gloaguen et al.

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Omni-Reward addresses modality imbalance and preference rigidity with omni-modal reward modeling framework.

Avg rating: 6.50 (6–8) · Zhuoran Jin et al.

PateGAIL++: Utility Optimized Private Trajectory Generation with Imitation Learning

PATEGAIL++ privacy-preserving trajectory generation framework using sensitivity-aware noise allocation for improved privacy-utility trade-off.

Avg rating: 5.00 (4–6) · Yingjie Ma et al.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.

Avg rating: 6.00 (4–8) · Zeyi Liao et al.

Reliable Weak-to-Strong Monitoring of LLM Agents

MetamerGen generates scene metamers aligned with human perception using foveal/peripheral features and latent diffusion.

Avg rating: 6.00 (4–8) · Neil Kale et al.

Spherical Watermark: Encryption-Free, Lossless Watermarking for Diffusion Models

Watermarks diffusion models losslessly via spherical mapping preserving Gaussian prior up to third-order moments.

Avg rating: 7.50 (6–8) · Xiaoxiao Hu et al.

Steering the Herd: A Framework for LLM-based Control of Social Learning

Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.

Avg rating: 6.50 (2–8) · Raghu Arghal et al.

Uncover Underlying Correspondence for Robust Multi-view Clustering

Proposes CorreGen, generative framework for multi-view clustering under noisy correspondence using EM algorithm.

Avg rating: 7.00 (6–8) · Haochen Zhou et al.

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.

Avg rating: 6.50 (4–8) · Rajiv Movva et al.