Automate Fashion Image Captioning using BLIP-2

Problems Description

Given the clothes image provide a short caption that describes the item. In general, in image captioning datasets (e.g., COCO, Fliker), the descriptions of fashion items have three unique features, which makes the automatic generation of captions a challenging task. First, fashion captioning needs to describe the attributes of an item, while image captioning generally narrates the objects and their relations in the image. e.g. image where the model is wearing a shirt, the general caption model describes such images as "male wearing a white shirt". This is incorrect since we want the model to describe the item. In this application, it is much more important to have a performant to caption the image than an interpretable model.

Solution

Using BLIP-2 to solve this problems

Requirements

Refer requirements.txt

Dataset

FAshion CAptioning Dataset (FACAD), the fashion captioning dataset consisting of over 993K images.

Properties of FACAD dataset:

Diverse fashion images of all four seasons, ages (kids and adults), categories (clothing, shoes, bag, accessories, etc.), angles of a human body (front, back, side, etc.). It tackles the captioning problem for fashion items. FACAD contains fine-grained descriptions of attributes of fashion-related items, while MS COCO narrates the objects and their relations in general images. FACAD has longer captions (21 words per sentence on average) compared with the 10.4 words per sentence of the MS COCO caption dataset Expression style of FACAD is enchanting, while that of MS COCO is plain without rich expressions. e.g. words like "pearly", "so-simple yet so-chic", and "retro flair" are more attractive than the plain MS COCO descriptions, like "person in a dress".

Technique

Metric

Most commonly used metric for image caption task, that is used to measuring the quality of an predict text based on reference texts are:

Bilingual Evaluation Understudy (Bleu) Score: a concept build on precision.

 Bleu = Number of correct predicted words / Number of total predicted words

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score: a set of metrics, rather than just one. ROUGE metric return recall, precision and f1-score. In our project have use F1-Rouge score. It's concept build on recall.

 Recall-N-gram = Number of correct predicted n-grams / Number of total target N-grams

 Precision-N-gram = Number of correct predicted n-grams / Number of total predict N-grams

 F1-Score = 2* ((Recall-N-gram * Precision-N-gram) / (Recall-N-gram + Precision-N-gram))

Both these score are build on concept of N-gram. In n-gram the value of n, is group n words and these words will always be in order. For this project have consider the value on n = 2.
Because caption of the fashion item are the attributes, it does not matter in which order the model predits those attributes words.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
get_data.py		get_data.py
preprocess_images.py		preprocess_images.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automate Fashion Image Captioning using BLIP-2

Problems Description

Solution

Requirements

Dataset

Technique

Metric

About

Releases

Packages

Languages

DngBack/FashionCaptioning

Folders and files

Latest commit

History

Repository files navigation

Automate Fashion Image Captioning using BLIP-2

Problems Description

Solution

Requirements

Dataset

Technique

Metric

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages