ICLR 2026 Orals
← All orals

Multimodal & Speech

Vision-language, multimodal models, speech recognition, text-to-speech, audio generation.

All papers

Min rating

Latent Fourier Transform

LatentFT provides frequency-domain controls for generative music via diffusion autoencoder with latent-space Fourier transform enabling timescale-based manipulation.

Avg rating: 5.00 (2–8) · Mason Long Wang et al.

Latent Speech-Text Transformer

Aggregates speech tokens into latent patches for efficient speech-text modeling with cross-modal alignment.

Avg rating: 6.00 (2–10) · Yen-Ju Lu et al.