WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · 202 A/B Avg rating: 6.00 (4–8)

Author-provided TL;DR

This paper builds a versatile audio-visual embedding LLM, which can not only achieve any-to-any retrieval but also generate prompt-aware embeddings.

Abstract

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Creates first unified audio-visual embedding space for text, audio, and video with hierarchical fusion and prompt-awareness.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces first LLM-based unified embedding for text, audio, and video modalities
Employs novel hierarchical feature fusion strategy for any-to-any cross-modal retrieval
Enables prompt-aware embeddings tailored to user instructions
Sets state-of-the-art on MMEB-v2 video benchmark with superior audio-video retrieval performance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Multimodal embeddings
Cross-modal retrieval
Feature fusion

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

MMEB-v2

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

audio-visual embeddings
multimodal LLMs
video retrieval

Something off? Let us know →

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis