Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

LLMs & Reasoning Fri, Apr 24 · 10:54 AM–11:04 AM · 203 A/B Avg rating: 6.00 (4–8)

Abstract

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes visual planning paradigm using purely visual representations for reasoning in spatially-grounded tasks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

New paradigm for planning through purely visual representations as supplement to language-based reasoning
Two-stage reinforcement learning framework with GRPO for post-training large vision models
Achieves 27% EM improvements in visual navigation tasks over language-based planning with better generalization

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Reinforcement learning
GRPO
Diffusion models
Vision models
Image generation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

FrozenLake
Maze
MiniBehavior

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Limited to Large Vision Model (7B size) as only available LVM; excludes multimodal models capable of generating multimodal outputs
from the paper
Image generation introduces computational overhead during inference compared to text responses; language-based reasoning can be equally or more time-consuming especially for thinking models
from the paper
Rule-based dynamics interpreter using pixel-wise comparisons effective in controlled setups; broader task settings with complex visual structures yet to be explored
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend visual planning paradigm to broader multimodal generation models for diverse tasks combined with more modalities
from the paper
Explore more compact image representations using fewer tokens to alleviate computational overhead
from the paper
Develop dynamic models eliciting actions from image pairs or use holistic neural models for validating visual transitions
from the paper

Author keywords

visual planning

Something off? Let us know →

Visual Planning: Let's Think Only with Images

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis