Latent Speech-Text Transformer

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srini Iyer, Duc Le

Multimodal & Speech Sat, Apr 25 · 11:42 AM–11:52 AM · Amphitheater Avg rating: 6.00 (2–10)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We introduce Latent Speech-Text Transformer, which patches long speech token sequences into latent units, improving text–speech transfer while cutting pre-training and inference compute, and significantly outperforming existing speech-text LLMs.

Abstract

Auto-regressive speech–text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The Code is available at https://github.com/facebookresearch/lst.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Aggregates speech tokens into latent patches for efficient speech-text modeling with cross-modal alignment.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes patch-based framework aggregating speech tokens into latent units
Aligns sequence-modeling granularity between speech and text modalities
Achieves up to +6.5% absolute gain on speech tasks under compute-controlled scaling
Stabilizes ASR adaptation and reduces effective sequence length during inference

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Speech-text modeling
Token aggregation
Multimodal transformers

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

LibriLight
People's Speech
Multilingual LibriSpeech
Spotify

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Focuses on half-duplex speech-text modeling; does not address full-duplex interaction
from the paper
Analysis restricted to pre-training stage without instruction fine-tuning or downstream adaptation
from the paper
Patching strategies rely on forced alignments during pre-training; fully alignment-free approaches remain open
from the paper
Limited to speech-text modality; not yet extended to image or video
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Speech–Text Models
Latent Patching
Multimodal Alignment
Large Language Models

Something off? Let us know →

Latent Speech-Text Transformer

Abstract

Author keywords

Related orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction