Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full image support (LLM operators & embeddings w/ CLIP) #37

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jmelovich
Copy link

Changes:

  • Added a new dataframe operator load_images(img_path_col, encoded_img_col) (in utils.py). The first parameter refers to a column of file paths to images (or URLs), and the second parameter is the name of the new column to be appended. This iterates through the dataframe, loads each image from the path, and encodes each image into a base64 string (to cut down on the conversions needed later for any vision LLM call).
    • Example usage: df.load_images("image_path", "image") -> then the 'image' column can be used as normal for getting the images in any other LOTUS operation
  • Modified task_instructions.py to automatically extract any image data and correctly append in the messages array
  • Modified sem_topk.py to also correctly extract image data and append it to the messages array
  • Added support for the CLIP family of embedding/retrieval models (in clip_model.py)
    • Included batching support to help reduce memory usage when indexing large datasets
    • Also included a custom method of creating combined text+image embeddings, with user configurable weights (by default an even split)
      • Example usage: rm = CLIPModelRetriever(similarity_weights=[0.4, 0.4, 0.1, 0.1]) # [text-text, image-image, text-image, image-text]
  • Added chunking to reduce memory_usage for the sem_search operation. The chunk_size is user configurable: sem_search(chunk_size=1000)

I did most of my testing on larger datasets, but I created a very simple jupyter notebook (examples/multimodal_tests.ipynb) that demonstrates CLIP working with a dataframe of images, as well as sem_topk, sem_filter, sem_map, and sem_search. In my own testing at least sem_sim_join, and sem_agg work too.

@liana313
Copy link
Collaborator

liana313 commented Nov 22, 2024

Thanks for the awesome work on this @jmelovich! It looks like there is overlap with PR #33, which plan to support images using the pandas types extension, which will be slightly more extensible as we add support more types as well. We've started a review of PR #33 and plan to merge it soon -- can you compare and merge with it instead of main? Also we'd be happy to coordinate ongoing dev efforts with you! Feel free to join our slack here so we can coordinate offline as well https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw

@jmelovich
Copy link
Author

Thanks for the awesome work on this @jmelovich! It looks like there is overlap with PR #37, which plan to support images using the pandas types extension, which will be slightly more extensible as we add support more types as well. We've started a review of PR #37 and plan to merge it soon -- can you compare and merge with it instead of main? Also we'd be happy to coordinate ongoing dev efforts with you! Feel free to join our slack here so we can coordinate offline as well https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw

Yes, I should be able to relatively simply rework it to use the pandas types extension. Are you sure you meant to mention PR #37 ? That is this current one, I assume you meant to refer to PR #33 ?

@liana313
Copy link
Collaborator

Yes, sorry, I meant PR #33 (edited), which adds support for images and has an implementation for each operator. One other main difference is that you added a new class for CLIP, although I believe we can support it using SentenceTransformers (example here https://www.sbert.net/examples/applications/image-search/README.html)

@jmelovich
Copy link
Author

Yes, sorry, I meant PR #33 (edited), which adds support for images and has an implementation for each operator. One other main difference is that you added a new class for CLIP, although I believe we can support it using SentenceTransformers (example here https://www.sbert.net/examples/applications/image-search/README.html)

Ok interesting, I was not familiar with SentenceTransformers so I'll check that out. In addition, one of the most useful things I added in my CLIP implementation was the ability to create combined text & image embeddings so that both an image and text can be used to create a single embedding- this has proved very useful on some VQA datasets I've tested, like Infoseek. If there is a way to implement this CLIP class more simply with SentenceTransformers I will look into it.

Also I want to note that my PR does support each operator I've tested- which is all but sem_dedup, and sem_extract (just not sure what to extract from an image).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants