Skip to content

CrossmodalGroup/X-Dim

Repository files navigation

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

License: MIT

Official PyTorch implementation of the paper Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching. We referred to the implementations of GPO to build up our codebase.

Motivation

Illustration of motivation. (a) For the mapped visual region and textual word features in the $d$-dimensional shared representation space, which can be represented as a dimensional semantic correspondence vector, existing paradigm typically employs a default independent aggregation for all dimensions to compose word-region semantic similarity. Yet, as we investigated in the state-of-the-art model NAAF, dimensions in that shared space are not mutually independent, where there are some dimensions with significant tendency, i.e., statistical co-occurrence probabilities, to jointly represent specific semantics, e.g., (b) for dog and (c) for man.

Aggregation comparison. Dimensional correspondences with mutual dependencies are marked with the same color, where exiting aggregation completely ignore this intrinsic information, probably leading to limitations, while our key idea is to mine and leverage it.

Introduction

In this paper, we are motivated by an insightful finding that dimensions are \emph{not mutually independent}, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone. Some results are better than those reported in the paper.

Results on MS-COCO (1K)

Visual Backbone Text Backbone R1 R5 R10 R1 R5 R10 Rsum Link
X-Dim BUTD region BERT-base 82.6 97.1 99.0 67.4 92.5 96.8 535.4 Here

Results on Flickr30K

Visual Backbone Text Backbone R1 R5 R10 R1 R5 R10 Rsum Link
X-Dim BUTD region BERT-base 83.5 96.9 98.0 67.5 89.1 93.3 528.2 Here

Preparation

Environment

We recommended the following dependencies.

Data

You can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.

Training

sh  train_region_f30k.sh
sh  train_region_coco.sh

Evaluation

Test on Flickr30K

python test.py

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco_precomp.

python testall.py

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{zhang2023unlocking,
  title={Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching},
  author={Zhang, Kun and Zhang, Lei and Hu, Bo and Zhu, Mengxiao and Mao, Zhendong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={4828--4837},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published