Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He
We detected the widespread deception of LLM under benign prompts and found its tendency increases with task difficulty.
Abstract
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions~(CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the *Deceptive Intention Score*, measures the model's bias toward a hidden objective. The second, the *Deceptive Behavior Score*, measures the inconsistency between the LLM's internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.
Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.
- Proposes framework based on Contact Searching Questions to quantify likelihood of LLM deception without ground truth
- Introduces Deceptive Intention Score measuring model bias toward hidden objectives
- Introduces Deceptive Behavior Score measuring inconsistency between internal belief and expressed output
- Contact Searching Questions framework
- Statistical metrics from psychological principles
Authors did not state explicit limitations.
Redesign deception benchmarks using statistical methods for detecting deception rather than assuming correctness of LLM responses
from the paperInvestigate underlying reasons for LLM deception to understand model motivation and predict when it will behave honestly
from the paperApply distinct evaluation methods and mitigation strategies for deception versus hallucination
from the paper
Author keywords
- Large Language Model
- Deception
- Lie
- Honest
- Trustworthy
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.