Optimistic Task Inference for Behavior Foundation Models
Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause
We propose an algorithm for fast online task inference in behavior foundation models.
Abstract
Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well- trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead.
OpTI-BFM uses optimistic decision criterion modeling uncertainty over reward functions to enable efficient task inference for behavior foundation models.
- Proposes optimistic decision criterion directly modeling uncertainty over reward functions for BFM task inference
- Provides regret bound for well-trained BFMs through connection to upper-confidence algorithms for linear bandits
- Enables identification and optimization of unseen reward function in handful of episodes with minimal compute overhead
- Successor features
- Upper-confidence algorithms
- Linear bandits
- Reinforcement learning
Theoretical guarantees only cover lower, episode-level updates; per-step updates empirically stronger
from the paperLimited assumptions on structure of feature space but understanding properties both formally and practically constitutes open area
from the paperTask embedding updating alone achieves great sample efficiency but fine-tuning additional BFM components may provide better long-run performance
from the paper
Extend theoretical results to per-step updates
from the paperInvestigate properties of feature space both formally and practically
from the paperFine-tune additional BFM components beyond task embedding for improved long-run performance
from the paper
Author keywords
- Behavior Foundation Models
- Zero-Shot Reinforcement Learning
- Deep Reinforcement Learning
- Fast Adaptation
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.