ICLR 2026 Orals

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · 202 A/B Avg rating: 6.00 (4–8)
Author-provided TL;DR

This paper builds a versatile audio-visual embedding LLM, which can not only achieve any-to-any retrieval but also generate prompt-aware embeddings.

Abstract

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Creates first unified audio-visual embedding space for text, audio, and video with hierarchical fusion and prompt-awareness.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Introduces first LLM-based unified embedding for text, audio, and video modalities
  • Employs novel hierarchical feature fusion strategy for any-to-any cross-modal retrieval
  • Enables prompt-aware embeddings tailored to user instructions
  • Sets state-of-the-art on MMEB-v2 video benchmark with superior audio-video retrieval performance
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Multimodal embeddings
  • Cross-modal retrieval
  • Feature fusion
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • MMEB-v2
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • audio-visual embeddings
  • multimodal LLMs
  • video retrieval

Related orals

Something off? Let us know →