Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
Abstract
We propose **Stable Video Infinity (SVI)** that can generate non-looping, ultra-long videos with stable visual quality, while supporting per-clip prompt control and multi-modal conditioning. While existing long-video methods attempt to _**mitigate accumulated errors**_ via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates **Error-Recycling Fine-Tuning**, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to _**actively identify and correct its own errors**_. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
Generates ultra-long videos by actively correcting self-generated errors through error-recycling fine-tuning.
- Identifies training-test hypothesis gap in long video generation leading to accumulated errors
- Proposes Error-Recycling Fine-Tuning enabling models to actively correct own errors
- Injects historical errors to intervene on clean inputs, simulating error-accumulated trajectories
- Scales videos from seconds to infinite durations without additional inference cost
- Error recycling
- Flow matching
- Diffusion transformers
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Infinite-Length Video Generation
- Error Accumulation
Related orals
Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
RealUID provides universal distillation for matching models without GANs, incorporating real data into one-step generator training.
GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models
GLASS Flows samples Markov transitions via inner flow matching models to improve inference-time reward alignment in flow and diffusion models.
Neon: Negative Extrapolation From Self-Training Improves Image Generation
Neon inverts model degradation from self-training by extrapolating away from it, improving generative models with minimal compute.
Generative Human Geometry Distribution
Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.
Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport
Cross-domain lossy compression unifies rate and classification constraints via optimal transport framework.