Official PyTorch implementation of the paper Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching. We referred to the implementations of GPO to build up our codebase.
Illustration of motivation. (a) For the mapped visual region and textual word features in the
Aggregation comparison. Dimensional correspondences with mutual dependencies are marked with the same color, where exiting aggregation completely ignore this intrinsic information, probably leading to limitations, while our key idea is to mine and leverage it.
In this paper, we are motivated by an insightful finding that dimensions are \emph{not mutually independent}, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone. Some results are better than those reported in the paper.
Visual Backbone | Text Backbone | R1 | R5 | R10 | R1 | R5 | R10 | Rsum | Link | |
---|---|---|---|---|---|---|---|---|---|---|
X-Dim | BUTD region | BERT-base | 82.6 | 97.1 | 99.0 | 67.4 | 92.5 | 96.8 | 535.4 | Here |
Visual Backbone | Text Backbone | R1 | R5 | R10 | R1 | R5 | R10 | Rsum | Link | |
---|---|---|---|---|---|---|---|---|---|---|
X-Dim | BUTD region | BERT-base | 83.5 | 96.9 | 98.0 | 67.5 | 89.1 | 93.3 | 528.2 | Here |
We recommended the following dependencies.
- Python 3.6
- PyTorch 1.8.0
- NumPy (>1.19.5)
- TensorBoard
- The specific required environment can be found here
You can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.
sh train_region_f30k.sh
sh train_region_coco.sh
Test on Flickr30K
python test.py
To do cross-validation on MSCOCO, pass fold5=True
with a model trained using
--data_name coco_precomp
.
python testall.py
Please use the following bib entry to cite this paper if you are using any resources from the repo.
@inproceedings{zhang2023unlocking,
title={Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching},
author={Zhang, Kun and Zhang, Lei and Hu, Bo and Zhu, Mengxiao and Mao, Zhendong},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={4828--4837},
year={2023}
}