You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I check with this code and try to apply it also on the model I am interested in LLaVA-Next (https://huggingface.co/docs/transformers/model_doc/llava_next). I know the number 49406 (rule out CLS and SEP, it’s 49408-2) represent for vocab_size. Since the same parameters in LLaVA-Next is None by default, I am wondering how to pick an apporpreate number for it, also, with other parameters. If you have any idea of it, please let me know.
The second question:
I find out a example down below:
fromPILimportImageimportrequestsfromtransformersimportCLIPProcessor, CLIPModelmodel=CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor=CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
url="http://images.cocodataset.org/val2017/000000039769.jpg"image=Image.open(requests.get(url, stream=True).raw)
inputs=processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs=model(**inputs)
logits_per_image=outputs.logits_per_image# this is the image-text similarity scoreprobs=logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities# Reference: https://huggingface.co/docs/transformers/model_doc/clip
Usually, it would need a text_input for asking the caption. However, I didn’t see the asking part in ‘mm-shap_clip_dataset.py’
There are some parameters I would need to revise when I implement LLaVA-Next
What a coincidence, I am currently also looking into LLaVA and extending MM-SHAP to such models as well.
MM-SHAP as presented in the paper works only for encoders and what you are talking about and what I am looking into, is autoregressive / decoder models.
I am also currently working on writing my thesis and submitting this month, so I am super busy and look into this more deeper in May. I can get back to you around then.
It could be that you find here what you were asking for: https://github.com/Heidelberg-NLP/CC-SHAP-VLM
I was working on my thesis and making new experiments which included MM-SHAP on three VL decoder models. I featured the new experiments in the paper linked in the new repo.
Just wanted to drop this now, I am still busy with thesis writing and I do not have much time to polish things. 🙈
The first question:
MM-SHAP/mm-shap_clip_dataset.py
Line 65 in 00a66bf
I check with this code and try to apply it also on the model I am interested in
LLaVA-Next
(https://huggingface.co/docs/transformers/model_doc/llava_next). I know the number 49406 (rule outCLS
andSEP
, it’s 49408-2) represent forvocab_size
. Since the same parameters in LLaVA-Next isNone
by default, I am wondering how to pick an apporpreate number for it, also, with other parameters. If you have any idea of it, please let me know.The second question:
I find out a example down below:
Usually, it would need a text_input for asking the caption. However, I didn’t see the asking part in ‘mm-shap_clip_dataset.py’
There are some parameters I would need to revise when I implement LLaVA-Next
LLaVA:-Next https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/blob/main/config.json
image_size
: 336vocab_size
: 32000it seems to me that LLaVA is more complex than Clip model since it seperate a picture into four.
Clip: https://huggingface.co/openai/clip-vit-base-patch32/blob/main/config.json
image_size
: 224vocab_size
: 49408Here is the setting of the expirement I would like to do on LLaVa-Next with MM-Shap metrics.
The text was updated successfully, but these errors were encountered: