Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

Laiguo Yin, Yixin Zhang, YUQING SUN, Lizhen Cui

Multimodal & Speech Thu, Apr 23 · 4:15 PM–4:25 PM · 202 A/B Avg rating: 6.67 (6–8)

Author-provided TL;DR

We propose multimodal aligned semantic knowledge, which leverages word embeddings as bridges to associate words with prototypes, capturing semantic relationships between words and further utilizing information from OOD words.

Abstract

While existing approaches address unpaired image-text matching by constructing cross-modal aligned knowledge, they often fail to identify semantically corresponding visual representations for Out-of-Distribution (OOD) words. Moreover, the distributional variance of visual representations associated with different words varies significantly, which negatively impacts matching accuracy. To address these issues, we propose a novel method namely Multimodal Aligned Semantic Knowledge (MASK), which leverages word embeddings as bridges to associate words with their corresponding prototypes, thereby enabling semantic knowledge alignment between the image and text modalities. For OOD words, the representative prototypes are constructed by leveraging the semantic relationships encoded in word embeddings. Beyond that, we introduce a prototype consistency contrastive loss to structurally regularize the feature space, effectively mitigating the adverse effects of variance. Experimental results on the Flickr30K and MSCOCO datasets demonstrate that MASK achieves superior performance in unpaired matching.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Leverages word embeddings as bridges to associate words with prototypes, enabling semantic knowledge alignment between modalities
Constructs representative prototypes for out-of-distribution words using semantic relationships encoded in word embeddings
Introduces prototype consistency contrastive loss to structurally regularize feature space and mitigate variance effects

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Word embeddings
Prototype matching
Contrastive learning

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Flickr30K
MSCOCO

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Unpaired Image-text Matching
Out-of-Distribution Word
Multimodal Aligned Semantic Knowledge
Prototype

Something off? Let us know →

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

Abstract

Author keywords

Related orals

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals