Official PyTorch implementation of the paper Enhanced Semantic Similarity Learning Framework for Image-Text Matching.
Please use the following bib entry to cite this paper if you are using any resources from the repo.
@article{zhang2023enhanced,
author={Zhang, Kun and Hu, Bo and Zhang, Huatian and Li, Zhe and Mao, Zhendong},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={Enhanced Semantic Similarity Learning Framework for Image-Text Matching},
year={2024},
volume={34},
number={4},
pages={2973-2988}
}
We referred to the implementations of GPO to build up our codebase.
Squares denote local dimension elements in a feature. Circles denote the measure-unit, i.e., the minimal basic component used to examine semantic similarity. Compared with (a) existing methods typically default to a static mechanism that only examines the single-dimensional cross-modal correspondence, (b) our key idea is to dynamically capture and learn multi-dimensional enhanced correspondence. That is, the number of dimensions constituting the measure-units is changed from existing only one to hierarchical multi-levels, enabling their examining information granularity to be enriched and enhanced to promote a more comprehensive semantic similarity learning.
In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance.The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone. Some results are better than those reported in the paper. However, it should be noted that the ensemble results in the paper may not be obtained by the best two checkpoints provided. It is lost due to not saving in time. You can train the model several times more and then combine any two to find the best ensemble performance. Please check out to the CLIP-based
branch for the code and pre-trained models.
Visual Backbone | Text Backbone | R1 | R5 | R10 | R1 | R5 | R10 | Rsum | Link | |
---|---|---|---|---|---|---|---|---|---|---|
ESL-H | BUTD region | BERT-base | 82.5 | 97.4 | 99.0 | 66.2 | 91.9 | 96.7 | 533.5 | Here |
ESL-A | BUTD region | BERT-base | 82.2 | 96.9 | 98.9 | 66.5 | 92.1 | 96.7 | 533.4 | Here |
Visual Backbone | Text Backbone | R1 | R5 | R10 | R1 | R5 | R10 | Rsum | Link | |
---|---|---|---|---|---|---|---|---|---|---|
ESL-H | BUTD region | BERT-base | 63.6 | 87.4 | 93.5 | 44.2 | 74.1 | 84.0 | 446.9 | Here |
ESL-A | BUTD region | BERT-base | 63.0 | 87.6 | 93.3 | 44.5 | 74.4 | 84.1 | 447.0 | Here |
Visual Backbone | Text Backbone | R1 | R5 | R10 | R1 | R5 | R10 | Rsum | Link | |
---|---|---|---|---|---|---|---|---|---|---|
ESL-H | BUTD region | BERT-base | 83.5 | 96.3 | 98.4 | 65.1 | 87.6 | 92.7 | 523.7 | Here |
ESL-A | BUTD region | BERT-base | 84.3 | 96.3 | 98.0 | 64.1 | 87.4 | 92.2 | 522.4 | Here |
We recommended the following dependencies.
- Python 3.6
- PyTorch 1.8.0
- NumPy (>1.19.5)
- TensorBoard
- The specific required environment can be found here Using conda env create -f ESL.yaml to create the corresponding environments.
You can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.
sh train_region_f30k.sh
sh train_region_coco.sh
For the dimensional selective mask, we design both heuristic and adaptive strategies. You can use the flag in vse.py (line 44)
heuristic_strategy = False
to control which strategy is selected. True -> heuristic strategy, False -> adaptive strategy.
Test on Flickr30K
python test.py
To do cross-validation on MSCOCO, pass fold5=True
with a model trained using
--data_name coco_precomp
.
python testall.py
To ensemble model, specify the model_path in test_stack.py, and run
python test_stack.py