Analysis of Multi-Modal Feature Representations by CLIP for Image Captioning

Abstract

Image captioning requires both visual understanding and linguistic processing to learn relationships between images and text. Though it is a difficult task, transformer-based models have made significant improvements as they are capable of representing multi-modal spaces (images and text) into a combined semantic space. In this project, we analyze the performance on image captioning of the transformer-based CLIP pre-training method by using it on the VizWiz-Captions dataset. We establish baseline metrics by implementing a Convolutional Neural Network - Recurrent Neural Network architecture on the same dataset. Our main contribution is to use CLIP's image encoder to successfully improve the performance of the baseline model on image captioning, a task on which CLIP's performance has not been studied extensively.

Models

Baseline CNN-LSTM, Based on the NIC Model: The CNN learns a vector representation of the image which is passed into the first unit of a 2-layered LSTM. The LSTM also takes the ground truth caption as input and generates the hidden states and output caption.

CLIP Image-Encoder Architecture: The ViT model splits an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard transformer encoder.

Attention-based Decoder Model Review:

Data

All images and their captions were obtained from the VizWiz dataset.

Training

The baseline CNN+RNN model is trained using train_captioning.py. The clip_rnn_image_captioning.ipynb notebook has modified this code for training the modified CLIP+RNN and CLIP+Attention models.

Validation

For validation along with the loss function BLEU, CIDEr and other metrics were evaluated on the saved generated captions. vizwiz_caption_evaluation.ipynb uses the evalutation code in vizwiz_eval_cap to calculate these. This code was obtained from here.

The generated captions and resulting scores are stored in the validation results/ directory.

Testing

Testing on all models was preformed using testing.py to generate image-caption pairs and TSNE plots of the learned image and word latent spaces. These results are saved in the test results/ directory.

Results

BLEU-1,2,3,4/METEOR metrics compared with other models:

Captioned Test Images:

CLIP+LSTM TSNE plot of image and caption embeddings:

Team Members:

Saad Saleem (University of Toronto, [email protected])
Shamitra Rohan (University of Toronto, [email protected])
Malikeh Ehghaghi (University of Toronto, [email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
annotations		annotations
model-diagrams		model-diagrams
test results		test results
validation results		validation results
vizwiz_eval_cap		vizwiz_eval_cap
.gitignore		.gitignore
README.md		README.md
Team10-final-project.pdf		Team10-final-project.pdf
clip.png		clip.png
clip_results_discussion.txt		clip_results_discussion.txt
clip_rnn_image_captioning.ipynb		clip_rnn_image_captioning.ipynb
data_loader.py		data_loader.py
get_stanford_models.sh		get_stanford_models.sh
models.py		models.py
requirements.txt		requirements.txt
testing.py		testing.py
train_captioning.py		train_captioning.py
utils.py		utils.py
validation.py		validation.py
vizwiz.py		vizwiz.py
vizwiz_caption_evaluation.ipynb		vizwiz_caption_evaluation.ipynb
vocab.pkl		vocab.pkl
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Multi-Modal Feature Representations by CLIP for Image Captioning

Abstract

Models

Data

Training

Validation

Testing

Results

Team Members:

About

Releases

Packages

Contributors 3

Languages

Malikeh97/csc2516-final-project

Folders and files

Latest commit

History

Repository files navigation

Analysis of Multi-Modal Feature Representations by CLIP for Image Captioning

Abstract

Models

Data

Training

Validation

Testing

Results

Team Members:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages