VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

Multimodal & Speech Fri, Apr 24 · 11:30 AM–11:40 AM · 201 A/B Avg rating: 6.67 (2–8)

OpenReview ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational vibe and surpassing open-source and proprietary dialogue models.

Abstract

Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Presents VibeVoice for zero-shot expressive long-form multi-speaker podcast generation using next-token diffusion.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Ultra-low frame rate (7.5 Hz) continuous speech tokenizers preserving audio fidelity while boosting efficiency
Next-token diffusion framework enabling long-form synthesis up to 90 minutes with up to 4 speakers
Annotation pipeline generating pseudo transcriptions and turn-taking labels for podcast data

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Next-token diffusion
Speech tokenization
LLM-based architecture
Text-to-speech
Diffusion models

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Podcast data

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Text-to-Speech; Podcast Generation

Something off? Let us know →

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Abstract

Author keywords

Related orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals