UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Jinchuan Tian, Sang-gil Lee, Zhifeng Kong, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping

Multimodal & Speech Fri, Apr 24 · 11:42 AM–11:52 AM · 201 A/B Avg rating: 6.00 (2–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

This paper introduces UALM, an audio language model designed to unify audio understanding, generation, and reasoning

Abstract

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Single model unifying audio understanding, text-to-audio generation, and text problem solving
UALM-Gen directly predicting audio tokens comparable to diffusion-based models
UALM-Reason leveraging understanding-generation for multimodal reasoning with iterative output refinement

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

audio language modeling
text-to-audio generation
multimodal reasoning
diffusion models

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Current UALM uses continuous audio encoder for input and discrete codec tokens for output
from the paper
SFT and DPO data curation based on synthetic captions with some hallucination and misalignment risk
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Build unified audio representation for more scalable joint training
from the paper
Design quantitative methods to assess synthetic audio caption quality at scale
from the paper
Develop better audio quality evaluation metrics for complex generation tasks
from the paper

Author keywords

Audio Language Model
Audio Understanding
Audio Generation

Something off? Let us know →

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Abstract

Author keywords

Related orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals