ICLR 2026 Orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

Laiguo Yin, Yixin Zhang, YUQING SUN, Lizhen Cui

Multimodal & Speech Thu, Apr 23 · 4:15 PM–4:25 PM · 202 A/B Avg rating: 6.67 (6–8)
Author-provided TL;DR

We propose multimodal aligned semantic knowledge, which leverages word embeddings as bridges to associate words with prototypes, capturing semantic relationships between words and further utilizing information from OOD words.

Abstract

While existing approaches address unpaired image-text matching by constructing cross-modal aligned knowledge, they often fail to identify semantically corresponding visual representations for Out-of-Distribution (OOD) words. Moreover, the distributional variance of visual representations associated with different words varies significantly, which negatively impacts matching accuracy. To address these issues, we propose a novel method namely Multimodal Aligned Semantic Knowledge (MASK), which leverages word embeddings as bridges to associate words with their corresponding prototypes, thereby enabling semantic knowledge alignment between the image and text modalities. For OOD words, the representative prototypes are constructed by leveraging the semantic relationships encoded in word embeddings. Beyond that, we introduce a prototype consistency contrastive loss to structurally regularize the feature space, effectively mitigating the adverse effects of variance. Experimental results on the Flickr30K and MSCOCO datasets demonstrate that MASK achieves superior performance in unpaired matching.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Leverages word embeddings as bridges to associate words with prototypes, enabling semantic knowledge alignment between modalities
  • Constructs representative prototypes for out-of-distribution words using semantic relationships encoded in word embeddings
  • Introduces prototype consistency contrastive loss to structurally regularize feature space and mitigate variance effects
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Word embeddings
  • Prototype matching
  • Contrastive learning
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Flickr30K
  • MSCOCO
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Unpaired Image-text Matching
  • Out-of-Distribution Word
  • Multimodal Aligned Semantic Knowledge
  • Prototype

Related orals

Something off? Let us know →