ICLR 2026 Orals

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi

Diffusion & Flow Matching Sat, Apr 25 · 10:30 AM–10:40 AM · 201 A/B Avg rating: 6.50 (4–8)

Abstract

We propose **Stable Video Infinity (SVI)** that can generate non-looping, ultra-long videos with stable visual quality, while supporting per-clip prompt control and multi-modal conditioning. While existing long-video methods attempt to _**mitigate accumulated errors**_ via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates **Error-Recycling Fine-Tuning**, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to _**actively identify and correct its own errors**_. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Generates ultra-long videos by actively correcting self-generated errors through error-recycling fine-tuning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Identifies training-test hypothesis gap in long video generation leading to accumulated errors
  • Proposes Error-Recycling Fine-Tuning enabling models to actively correct own errors
  • Injects historical errors to intervene on clean inputs, simulating error-accumulated trajectories
  • Scales videos from seconds to infinite durations without additional inference cost
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Error recycling
  • Flow matching
  • Diffusion transformers
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Infinite-Length Video Generation
  • Error Accumulation

Related orals

Generative Human Geometry Distribution

Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.

Avg rating: 5.50 (2–8) · Xiangjun Tang et al.
Something off? Let us know →