Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held
a self-supervised object-centric world model that learns keypoints, and masks directly from videos, supports multi-modal conditioning, scaled to real-world multi-object datasets
Abstract
We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
LPWM enables self-supervised object-centric world modeling with latent action module for stochastic video generation and control.
- Autonomous discovery of keypoints, bounding boxes and object masks directly from video without supervision
- Latent action module enabling flexible conditioning on actions, language and image goals for controllable generation
- State-of-the-art results on real-world and synthetic datasets with applicability to goal-conditioned imitation learning
- Self-supervised learning
- Object-centric representation
- Latent action modeling
- End-to-end video modeling
Currently depends on datasets with small camera motion and recurring scenarios such as robotics or video games
from the paperNot yet applicable to general-purpose large-scale video data
from the paper
Scale to diverse datasets beyond robotics and video games
from the paperEnable unified multi-modal conditioning with simultaneous action, language and image signals
from the paperIntegrate explicit reward modeling for reinforcement learning
from the paper
Author keywords
- World Model
- Self-supervised
- unsupervised
- object-centric
- video prediciton
- video generation
- imitation learning
- latent particles
- vae
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.