Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman E. Ozdaglar, Kaiqing Zhang

Theory & Optimization Sat, Apr 25 · 3:27 PM–3:37 PM · Amphitheater Avg rating: 5.50 (2–8)

Author-provided TL;DR

Hardness and positive results for online learning with ranking feedback. Together with equilibrium computation with ranking feedback.

Abstract

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MRT systematically stress tests LLM agent monitoring revealing agent awareness dominates and hybrid scaffolding enables weak-to-strong.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Monitor Red Teaming workflow varying agent/monitor awareness, evasion strategies and evaluation across tool-calling and computer-use environments
Hybrid hierarchical-sequential monitor design consistently outperforms baselines and enables weaker monitors to oversee stronger agents
Demonstrates agent awareness dominates monitor awareness; targeted human oversight improves TPR by 15% at FPR=0.01

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Red teaming
LLM monitoring
Adversarial evaluation
Hierarchical-sequential scaffolding

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

SHADE-Arena
CUA-SHADE-Arena

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Red-teaming setup uses prompting to induce misbehavior, not complete model of real-world misalignment
from the paper
Setup approximates indirect prompt injection and stealthy behavior but stronger adversaries could emerge from adversarial fine-tuning
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Apply MRT to adversarially fine-tuned agents beyond prompting-based evasions
from the paper
Develop more sophisticated monitoring methods beyond sequential, hierarchical and hybrid scaffolds
from the paper
Design sabotage tasks on real applications with richer VM environments
from the paper
Better scaffold and interface design for human overseers to reduce bias and false positives
from the paper

Author keywords

Online Learning
Equilibrium Computation
Human Feedback

Something off? Let us know →

Online Learning and Equilibrium Computation with Ranking Feedback

Abstract

Author keywords

Related orals

On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Non-Convex Federated Optimization under Cost-Aware Client Selection

Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

Quantitative Bounds for Length Generalization in Transformers