MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Gireesh Nandiraju, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

Reinforcement Learning & Agents Fri, Apr 24 · 10:30 AM–10:40 AM · 203 A/B Avg rating: 6.50 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We present MomaGraph, a unified scene representation for task-oriented understanding, along with a dataset and benchmark built upon it, and MomaGraph-R1, a 7B model that constructs MomaGraph representations and generates task plans.

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To overcome these shortcomings, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. To address this, we construct MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, and design MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision–language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments show that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. More visualizations and robot demonstrations are available at https://hybridrobotics.github.io/MomaGraph/.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MomaGraph learns unified task-oriented scene representations integrating spatial-functional relationships for embodied agents to perform planning and manipulation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

MomaGraph unified scene representation integrating spatial-functional relationships and interactive elements
MomaGraph-Scenes large-scale dataset with richly annotated task-driven scene graphs
MomaGraph-R1 7B VLM trained with RL achieving 71.6% accuracy on benchmark

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Vision-language models
Scene graphs
Reinforcement learning
Task planning
Embodied AI

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MomaGraph-Scenes

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Scene Graph
Task Planning
Spatial Understanding
Mobile Manipulation

Something off? Let us know →

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

Abstract

Author keywords

Related orals

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport