Image captioning requires both visual understanding and linguistic processing to learn relationships between images and text. Though it is a difficult task, transformer-based models have made significant improvements as they are capable of representing multi-modal spaces (images and text) into a combined semantic space. In this project, we analyze the performance on image captioning of the transformer-based CLIP pre-training method by using it on the VizWiz-Captions dataset. We establish baseline metrics by implementing a Convolutional Neural Network - Recurrent Neural Network architecture on the same dataset. Our main contribution is to use CLIP's image encoder to successfully improve the performance of the baseline model on image captioning, a task on which CLIP's performance has not been studied extensively.
Baseline CNN-LSTM, Based on the NIC Model: The CNN learns a vector representation of the image which is passed into the first unit of a 2-layered LSTM. The LSTM also takes the ground truth caption as input and generates the hidden states and output caption.
CLIP Image-Encoder Architecture: The ViT model splits an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard transformer encoder.
Attention-based Decoder Model Review:
All images and their captions were obtained from the VizWiz dataset.
The baseline CNN+RNN model is trained using train_captioning.py
. The clip_rnn_image_captioning.ipynb
notebook has modified this code for training the modified CLIP+RNN and CLIP+Attention models.
For validation along with the loss function BLEU, CIDEr and other metrics were evaluated on the saved generated captions.
vizwiz_caption_evaluation.ipynb
uses the evalutation code in vizwiz_eval_cap
to calculate these. This code was obtained from here.
The generated captions and resulting scores are stored in the validation results/
directory.
Testing on all models was preformed using testing.py
to generate image-caption pairs and TSNE plots of the learned image and word latent spaces. These results are saved in the test results/
directory.
BLEU-1,2,3,4/METEOR metrics compared with other models:
CLIP+LSTM TSNE plot of image and caption embeddings:
- Saad Saleem (University of Toronto, [email protected])
- Shamitra Rohan (University of Toronto, [email protected])
- Malikeh Ehghaghi (University of Toronto, [email protected])