Optimistic Task Inference for Behavior Foundation Models

Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause

Reinforcement Learning & Agents Sat, Apr 25 · 4:03 PM–4:13 PM · Amphitheater Avg rating: 6.50 (6–8)

Author-provided TL;DR

We propose an algorithm for fast online task inference in behavior foundation models.

Abstract

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well- trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

OpTI-BFM uses optimistic decision criterion modeling uncertainty over reward functions to enable efficient task inference for behavior foundation models.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes optimistic decision criterion directly modeling uncertainty over reward functions for BFM task inference
Provides regret bound for well-trained BFMs through connection to upper-confidence algorithms for linear bandits
Enables identification and optimization of unseen reward function in handful of episodes with minimal compute overhead

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Successor features
Upper-confidence algorithms
Linear bandits
Reinforcement learning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical guarantees only cover lower, episode-level updates; per-step updates empirically stronger
from the paper
Limited assumptions on structure of feature space but understanding properties both formally and practically constitutes open area
from the paper
Task embedding updating alone achieves great sample efficiency but fine-tuning additional BFM components may provide better long-run performance
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend theoretical results to per-step updates
from the paper
Investigate properties of feature space both formally and practically
from the paper
Fine-tune additional BFM components beyond task embedding for improved long-run performance
from the paper

Author keywords

Behavior Foundation Models
Zero-Shot Reinforcement Learning
Deep Reinforcement Learning
Fast Adaptation

Something off? Let us know →

Optimistic Task Inference for Behavior Foundation Models

Abstract

Author keywords

Related orals

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport