Full image support (LLM operators & embeddings w/ CLIP) #37

jmelovich · 2024-11-20T22:54:13Z

Changes:

Added a new dataframe operator load_images(img_path_col, encoded_img_col) (in utils.py). The first parameter refers to a column of file paths to images (or URLs), and the second parameter is the name of the new column to be appended. This iterates through the dataframe, loads each image from the path, and encodes each image into a base64 string (to cut down on the conversions needed later for any vision LLM call).
- Example usage: df.load_images("image_path", "image") -> then the 'image' column can be used as normal for getting the images in any other LOTUS operation
Modified task_instructions.py to automatically extract any image data and correctly append in the messages array
Modified sem_topk.py to also correctly extract image data and append it to the messages array
Added support for the CLIP family of embedding/retrieval models (in clip_model.py)
- Included batching support to help reduce memory usage when indexing large datasets
- Also included a custom method of creating combined text+image embeddings, with user configurable weights (by default an even split)
  - Example usage: rm = CLIPModelRetriever(similarity_weights=[0.4, 0.4, 0.1, 0.1]) # [text-text, image-image, text-image, image-text]
Added chunking to reduce memory_usage for the sem_search operation. The chunk_size is user configurable: sem_search(chunk_size=1000)

I did most of my testing on larger datasets, but I created a very simple jupyter notebook (examples/multimodal_tests.ipynb) that demonstrates CLIP working with a dataframe of images, as well as sem_topk, sem_filter, sem_map, and sem_search. In my own testing at least sem_sim_join, and sem_agg work too.

CLIP integration, and vision llm support partial working

…for now

- after 3 failed backoffs for making openai llm request, it skips - implemented basic method for created combined text+image embeddings with CLIP

removed accidental testing image data from repo

liana313 · 2024-11-22T19:25:29Z

Thanks for the awesome work on this @jmelovich! It looks like there is overlap with PR #33, which plan to support images using the pandas types extension, which will be slightly more extensible as we add support more types as well. We've started a review of PR #33 and plan to merge it soon -- can you compare and merge with it instead of main? Also we'd be happy to coordinate ongoing dev efforts with you! Feel free to join our slack here so we can coordinate offline as well https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw

jmelovich · 2024-11-22T19:40:20Z

Thanks for the awesome work on this @jmelovich! It looks like there is overlap with PR #37, which plan to support images using the pandas types extension, which will be slightly more extensible as we add support more types as well. We've started a review of PR #37 and plan to merge it soon -- can you compare and merge with it instead of main? Also we'd be happy to coordinate ongoing dev efforts with you! Feel free to join our slack here so we can coordinate offline as well https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw

Yes, I should be able to relatively simply rework it to use the pandas types extension. Are you sure you meant to mention PR #37 ? That is this current one, I assume you meant to refer to PR #33 ?

liana313 · 2024-11-22T19:47:37Z

Yes, sorry, I meant PR #33 (edited), which adds support for images and has an implementation for each operator. One other main difference is that you added a new class for CLIP, although I believe we can support it using SentenceTransformers (example here https://www.sbert.net/examples/applications/image-search/README.html)

jmelovich · 2024-11-22T20:14:46Z

Yes, sorry, I meant PR #33 (edited), which adds support for images and has an implementation for each operator. One other main difference is that you added a new class for CLIP, although I believe we can support it using SentenceTransformers (example here https://www.sbert.net/examples/applications/image-search/README.html)

Ok interesting, I was not familiar with SentenceTransformers so I'll check that out. In addition, one of the most useful things I added in my CLIP implementation was the ability to create combined text & image embeddings so that both an image and text can be used to create a single embedding- this has proved very useful on some VQA datasets I've tested, like Infoseek. If there is a way to implement this CLIP class more simply with SentenceTransformers I will look into it.

Also I want to note that my PR does support each operator I've tested- which is all but sem_dedup, and sem_extract (just not sure what to extract from an image).

jmelovich added 13 commits October 6, 2024 17:56

CLIP integration, and vision llm support partial working

f8cea97

CLIP integration, and vision llm support partial working

Update README.md

f4179bc

task instructions update

f380842

fixed image extraction for multicolumns

9585bbc

included demo notebook

59d02f3

added animal images dataset

587e75d

removed vestigial test files

edc1551

added batching support for topk, and disable token counting for topk …

ffe077e

…for now

Implemented method for combined text+image embeddings with CLIP

975ab64

- after 3 failed backoffs for making openai llm request, it skips - implemented basic method for created combined text+image embeddings with CLIP

Merge remote-tracking branch 'upstream/main' into mm-integration

5faa9d4

Fixed merging errors, vision support should be good

30ec7a7

removed accidental testing image data from repo

added support for image urls in load_images

7b31819

included very simple demo notebook for clip and image operators

e5cd100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full image support (LLM operators & embeddings w/ CLIP) #37

Full image support (LLM operators & embeddings w/ CLIP) #37

jmelovich commented Nov 20, 2024

liana313 commented Nov 22, 2024 •

edited

Loading

jmelovich commented Nov 22, 2024

liana313 commented Nov 22, 2024

jmelovich commented Nov 22, 2024

Full image support (LLM operators & embeddings w/ CLIP) #37

Are you sure you want to change the base?

Full image support (LLM operators & embeddings w/ CLIP) #37

Conversation

jmelovich commented Nov 20, 2024

liana313 commented Nov 22, 2024 • edited Loading

jmelovich commented Nov 22, 2024

liana313 commented Nov 22, 2024

jmelovich commented Nov 22, 2024

liana313 commented Nov 22, 2024 •

edited

Loading