Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Christina Q Knight, Zifan Wang

Safety, Privacy & Alignment Sat, Apr 25 · 3:15 PM–3:25 PM · 204 A/B Avg rating: 6.00 (4–8)

Author-provided TL;DR

This paper introduces a monitor red teaming workflow to stress test systems for detecting covert misbehavior in LLM agents, finding that a well-designed monitor scaffold enables weaker models to oversee strong aware attackers.

Abstract

We stress test monitoring systems for detecting covert misbehavior in LLM agents (e.g., secretly exfiltrating data). We propose a monitor red teaming (MRT) workflow that varies agent and monitor awareness, adversarial evasion strategies, and evaluation across tool-calling (SHADE-Arena) and computer-use (CUA-SHADE-Arena) environments. We benchmark standard monitor scaffoldings and introduce a hybrid hierarchical--sequential design. Our experiments yield three findings. First, agent awareness dominates monitor awareness: agents that know they are monitored substantially degrade detection, while increasing monitor awareness helps less than expected. Second, monitor scaffolding matters: our hybrid design consistently outperforms baselines and enables weaker monitors to oversee stronger agents (a weak-to-strong effect). Third, targeted human oversight is key: escalating only pre-flagged cases improves TPR by 15% at FPR=0.01. Our work positions MRT as a standard workflow for stress-testing oversight, revealing robustness gaps in both LLM- and human-based monitoring. We release code, data, and logs to support further research.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MetamerGen generates scene metamers aligned with human perception using foveal/peripheral features and latent diffusion.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Dual-stream DINOv2 representation fusing detailed foveal features with peripherally degraded features for scene context
Demonstrates sparse conditioning from gist and fixations sufficient for human-aligned scene generation
Behavioral validation identifying metamers through same-different experiments with participants

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Latent diffusion models
DINOv2 embeddings
Foveated vision modeling
Scene understanding

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Inherits limitations from pre-trained Stable Diffusion including difficulties with fine-grained facial details and limb articulations
from the paper
Text generation often unreadable even when directly fixated
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Agent Safety
Chain-of-Thought Monitoring
Large Language Model

Something off? Let us know →

Reliable Weak-to-Strong Monitoring of LLM Agents

Abstract

Author keywords

Related orals

LLM Fingerprinting via Semantically Conditioned Watermarks

Steering the Herd: A Framework for LLM-based Control of Social Learning

Every Language Model Has a Forgery-Resistant Signature

Gaussian certified unlearning in high dimensions: A hypothesis testing approach

Differentially Private Domain Discovery