-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full image support (LLM operators & embeddings w/ CLIP) #37
base: main
Are you sure you want to change the base?
Conversation
CLIP integration, and vision llm support partial working
- after 3 failed backoffs for making openai llm request, it skips - implemented basic method for created combined text+image embeddings with CLIP
removed accidental testing image data from repo
Thanks for the awesome work on this @jmelovich! It looks like there is overlap with PR #33, which plan to support images using the pandas types extension, which will be slightly more extensible as we add support more types as well. We've started a review of PR #33 and plan to merge it soon -- can you compare and merge with it instead of main? Also we'd be happy to coordinate ongoing dev efforts with you! Feel free to join our slack here so we can coordinate offline as well https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw |
Yes, I should be able to relatively simply rework it to use the pandas types extension. Are you sure you meant to mention PR #37 ? That is this current one, I assume you meant to refer to PR #33 ? |
Yes, sorry, I meant PR #33 (edited), which adds support for images and has an implementation for each operator. One other main difference is that you added a new class for CLIP, although I believe we can support it using SentenceTransformers (example here https://www.sbert.net/examples/applications/image-search/README.html) |
Ok interesting, I was not familiar with SentenceTransformers so I'll check that out. In addition, one of the most useful things I added in my CLIP implementation was the ability to create combined text & image embeddings so that both an image and text can be used to create a single embedding- this has proved very useful on some VQA datasets I've tested, like Infoseek. If there is a way to implement this CLIP class more simply with SentenceTransformers I will look into it. Also I want to note that my PR does support each operator I've tested- which is all but sem_dedup, and sem_extract (just not sure what to extract from an image). |
Changes:
load_images(img_path_col, encoded_img_col)
(in utils.py). The first parameter refers to a column of file paths to images (or URLs), and the second parameter is the name of the new column to be appended. This iterates through the dataframe, loads each image from the path, and encodes each image into a base64 string (to cut down on the conversions needed later for any vision LLM call).df.load_images("image_path", "image")
-> then the 'image' column can be used as normal for getting the images in any other LOTUS operationrm = CLIPModelRetriever(similarity_weights=[0.4, 0.4, 0.1, 0.1]) # [text-text, image-image, text-image, image-text]
sem_search(chunk_size=1000)
I did most of my testing on larger datasets, but I created a very simple jupyter notebook (examples/multimodal_tests.ipynb) that demonstrates CLIP working with a dataframe of images, as well as sem_topk, sem_filter, sem_map, and sem_search. In my own testing at least sem_sim_join, and sem_agg work too.