Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
Abstract
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
Proposes visual planning paradigm using purely visual representations for reasoning in spatially-grounded tasks.
- New paradigm for planning through purely visual representations as supplement to language-based reasoning
- Two-stage reinforcement learning framework with GRPO for post-training large vision models
- Achieves 27% EM improvements in visual navigation tasks over language-based planning with better generalization
- Reinforcement learning
- GRPO
- Diffusion models
- Vision models
- Image generation
- FrozenLake
- Maze
- MiniBehavior
Limited to Large Vision Model (7B size) as only available LVM; excludes multimodal models capable of generating multimodal outputs
from the paperImage generation introduces computational overhead during inference compared to text responses; language-based reasoning can be equally or more time-consuming especially for thinking models
from the paperRule-based dynamics interpreter using pixel-wise comparisons effective in controlled setups; broader task settings with complex visual structures yet to be explored
from the paper
Extend visual planning paradigm to broader multimodal generation models for diverse tasks combined with more modalities
from the paperExplore more compact image representations using fewer tokens to alleviate computational overhead
from the paperDevelop dynamic models eliciting actions from image pairs or use holistic neural models for validating visual transitions
from the paper
Author keywords
- visual planning
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.