VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational vibe and surpassing open-source and proprietary dialogue models.
Abstract
Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.
Presents VibeVoice for zero-shot expressive long-form multi-speaker podcast generation using next-token diffusion.
- Ultra-low frame rate (7.5 Hz) continuous speech tokenizers preserving audio fidelity while boosting efficiency
- Next-token diffusion framework enabling long-form synthesis up to 90 minutes with up to 4 speakers
- Annotation pipeline generating pseudo transcriptions and turn-taking labels for podcast data
- Next-token diffusion
- Speech tokenization
- LLM-based architecture
- Text-to-speech
- Diffusion models
- Podcast data
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Text-to-Speech; Podcast Generation
Related orals
Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching
MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed uses learnable meta tokens with matryoshka training to enable test-time scaling for multimodal retrieval balancing quality and efficiency.
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
BioX-Bridge enables parameter-efficient cross-modal knowledge transfer across biosignals using lightweight prototype-based bridge networks between foundation models.