ICLR 2026 Orals

Optimistic Task Inference for Behavior Foundation Models

Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause

Reinforcement Learning & Agents Sat, Apr 25 · 4:03 PM–4:13 PM · Amphitheater Avg rating: 6.50 (6–8)
Author-provided TL;DR

We propose an algorithm for fast online task inference in behavior foundation models.

Abstract

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well- trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

OpTI-BFM uses optimistic decision criterion modeling uncertainty over reward functions to enable efficient task inference for behavior foundation models.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Proposes optimistic decision criterion directly modeling uncertainty over reward functions for BFM task inference
  • Provides regret bound for well-trained BFMs through connection to upper-confidence algorithms for linear bandits
  • Enables identification and optimization of unseen reward function in handful of episodes with minimal compute overhead
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Successor features
  • Upper-confidence algorithms
  • Linear bandits
  • Reinforcement learning
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Theoretical guarantees only cover lower, episode-level updates; per-step updates empirically stronger
    from the paper
  • Limited assumptions on structure of feature space but understanding properties both formally and practically constitutes open area
    from the paper
  • Task embedding updating alone achieves great sample efficiency but fine-tuning additional BFM components may provide better long-run performance
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Extend theoretical results to per-step updates
    from the paper
  • Investigate properties of feature space both formally and practically
    from the paper
  • Fine-tune additional BFM components beyond task embedding for improved long-run performance
    from the paper

Author keywords

  • Behavior Foundation Models
  • Zero-Shot Reinforcement Learning
  • Deep Reinforcement Learning
  • Fast Adaptation

Related orals

Something off? Let us know →