ICLR 2026 Orals

Exploratory Diffusion Model for Unsupervised Reinforcement Learning

Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu

Reinforcement Learning & Agents Fri, Apr 24 · 3:39 PM–3:49 PM · 201 A/B Avg rating: 6.00 (6–6)
Author-provided TL;DR

We propose Exploratory Diffusion Model (ExDM), boosting unsupervised exploration and few-shot fine-tuning by diffusion models.

Abstract

Unsupervised reinforcement learning (URL) pre-trains agents by exploring diverse states in reward-free environments, aiming to enable efficient adaptation to various downstream tasks. Without extrinsic rewards, prior methods rely on intrinsic objectives, but heterogeneous exploration data demand strong modeling capacity for both intrinsic reward design and policy learning. We introduce the **Ex**ploratory **D**iffusion **M**odel (**ExDM**), which leverages the expressive power of diffusion models to fit diverse replay-buffer distributions, thus providing accurate density estimates and a score-based intrinsic reward that drives exploration into under-visited regions. This mechanism substantially broadens state coverage and yields robust pre-trained policies. Beyond exploration, ExDM offers theoretical guarantees and practical algorithms for fine-tuning diffusion policies under limited interactions, overcoming instability and computational overhead from multi-step sampling. Extensive experiments on Maze2d and URLB show that ExDM achieves superior exploration and faster downstream adaptation, establishing new state-of-the-art results, particularly in environments with complex structure or cross-embodiment settings. The source code is provided at https://github.com/yingchengyang/ExDM.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes ExDM using diffusion models for exploration and policy learning in unsupervised reinforcement learning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Leverages expressive power of diffusion models to fit diverse replay-buffer distributions for accurate density estimation
  • Score-based intrinsic reward driving exploration into under-visited regions
  • Theoretical guarantees and practical algorithms for fine-tuning diffusion policies under limited interactions
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Diffusion models
  • Unsupervised reinforcement learning
  • Density estimation
  • Intrinsic motivation
  • Policy learning
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Maze2d
  • URLB
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • reinforcement learning
  • diffusion policy
  • unsupervised reinforcement learning
  • exploration

Related orals

Something off? Let us know →