ICLR 2026 Orals

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 203 A/B Avg rating: 6.00 (4–8)

Abstract

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes visual planning paradigm using purely visual representations for reasoning in spatially-grounded tasks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • New paradigm for planning through purely visual representations as supplement to language-based reasoning
  • Two-stage reinforcement learning framework with GRPO for post-training large vision models
  • Achieves 27% EM improvements in visual navigation tasks over language-based planning with better generalization
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reinforcement learning
  • GRPO
  • Diffusion models
  • Vision models
  • Image generation
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • FrozenLake
  • Maze
  • MiniBehavior
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Limited to Large Vision Model (7B size) as only available LVM; excludes multimodal models capable of generating multimodal outputs
    from the paper
  • Image generation introduces computational overhead during inference compared to text responses; language-based reasoning can be equally or more time-consuming especially for thinking models
    from the paper
  • Rule-based dynamics interpreter using pixel-wise comparisons effective in controlled setups; broader task settings with complex visual structures yet to be explored
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Extend visual planning paradigm to broader multimodal generation models for diverse tasks combined with more modalities
    from the paper
  • Explore more compact image representations using fewer tokens to alleviate computational overhead
    from the paper
  • Develop dynamic models eliciting actions from image pairs or use holistic neural models for validating visual transitions
    from the paper

Author keywords

  • visual planning

Related orals

Something off? Let us know →