Latent Fourier Transform

Mason Long Wang, Cheng-Zhi Anna Huang

Multimodal & Speech Sat, Apr 25 · 4:15 PM–4:25 PM · 201 A/B Avg rating: 5.00 (2–8)

Author-provided TL;DR

We introduce novel frequency-domain controls for generative music models by applying the Fourier transform to the latent space of a diffusion autoencoder.

Abstract

We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing frequencies in the latent space in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative music models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

LatentFT provides frequency-domain controls for generative music via diffusion autoencoder with latent-space Fourier transform enabling timescale-based manipulation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Novel frequency-domain controls for generative music models operating on latent-space frequencies to shape structure
Masking latents in frequency domain during training for coherent manipulation at inference
Technique for hearing isolated frequency components revealing different musical attributes in latent spectrum

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

diffusion autoencoders
Fourier transform
latent-space frequency analysis

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Enable real-time interactivity for frequency-domain controls
from the paper
Disentangle latent spectrum along semantic axes combining timescale and semantic control
from the paper

Author keywords

Music Generation
Signal Processing
Diffusion Models
Audio
Music
Audio Generation
Controllable Generation
Fourier Transform
Diffusion Autoencoders

Something off? Let us know →

Latent Fourier Transform

Abstract

Author keywords

Related orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction