Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching
Laiguo Yin, Yixin Zhang, YUQING SUN, Lizhen Cui
We propose multimodal aligned semantic knowledge, which leverages word embeddings as bridges to associate words with prototypes, capturing semantic relationships between words and further utilizing information from OOD words.
Abstract
While existing approaches address unpaired image-text matching by constructing cross-modal aligned knowledge, they often fail to identify semantically corresponding visual representations for Out-of-Distribution (OOD) words. Moreover, the distributional variance of visual representations associated with different words varies significantly, which negatively impacts matching accuracy. To address these issues, we propose a novel method namely Multimodal Aligned Semantic Knowledge (MASK), which leverages word embeddings as bridges to associate words with their corresponding prototypes, thereby enabling semantic knowledge alignment between the image and text modalities. For OOD words, the representative prototypes are constructed by leveraging the semantic relationships encoded in word embeddings. Beyond that, we introduce a prototype consistency contrastive loss to structurally regularize the feature space, effectively mitigating the adverse effects of variance. Experimental results on the Flickr30K and MSCOCO datasets demonstrate that MASK achieves superior performance in unpaired matching.
MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.
- Leverages word embeddings as bridges to associate words with prototypes, enabling semantic knowledge alignment between modalities
- Constructs representative prototypes for out-of-distribution words using semantic relationships encoded in word embeddings
- Introduces prototype consistency contrastive loss to structurally regularize feature space and mitigate variance effects
- Word embeddings
- Prototype matching
- Contrastive learning
- Flickr30K
- MSCOCO
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Unpaired Image-text Matching
- Out-of-Distribution Word
- Multimodal Aligned Semantic Knowledge
- Prototype
Related orals
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.
VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
Presents VibeVoice for zero-shot expressive long-form multi-speaker podcast generation using next-token diffusion.
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed uses learnable meta tokens with matryoshka training to enable test-time scaling for multimodal retrieval balancing quality and efficiency.
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
BioX-Bridge enables parameter-efficient cross-modal knowledge transfer across biosignals using lightweight prototype-based bridge networks between foundation models.