Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix visual encoders with no CLS #11982

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

alex-jw-brooks
Copy link
Contributor

@alex-jw-brooks alex-jw-brooks commented Feb 20, 2025

This PR fixes the bug outlined in this issue: #10157

As well as discussed in projects leverage llama cpp like ollama: ollama/ollama#7441 ollama/ollama-python#433

Summary

In clip.cpp, we initialize a "patches" vector, which is then used to index into the embedding with a get rows op (here).

This can cause the out of bounds assertion to be triggered when run with the CPU backend when it's used with a visual encoder that has no CLS embedding, e.g., siglip. I.e.,

  • Siglip has an embedding dimension of 729 and no CLS
  • This causes the "patches" vector to get initialized with values [1, 2, ..., 729] instead of the correct [0, 1, ..., 728]
  • Because of the offset for the CLS that isn't there, the last value triggers the assertion

Steps to Verify

  • Download a model that doesn't have siglip as the visual encoder; I verified this fix with granite vision, which uses siglip, but you can also check it with nanollava.
  1. Download GGUF files from ollama
wget https://registry.ollama.ai/v2/qnguyen3/nanollava/blobs/sha256:511ad0036913a93bd04aa1c08de98bcdfa15bcbe0e03e5e9e4334039531ba863 -O model.gguf

wget  https://registry.ollama.ai/v2/qnguyen3/nanollava/blobs/sha256:8a16a1e306eba4791488fd4f9585403ecb03da9b71d9f36e8944c33f35ca8754 -O projector.gguf
  1. Build llava llama cli with cmake --build build --config Release --target llama-llava-cli

  2. Try running the model.

MODEL_GGUF_PATH=/Users/alexanderjbrooks/Desktop/nanollava/model.gguf
PROJECT_GGUF_PATH=/Users/alexanderjbrooks/Desktop/nanollava/projector.gguf
IMG=~/Desktop/duck.jpg

./build/bin/llama-llava-cli -m $MODEL_GGUF_PATH \
    --mmproj $PROJECT_GGUF_PATH \
    --image $IMG \
    -p "<|im_start|>user<image>\nCan you describe this image?<|im_end|>\n<|im_start|>assistant" \
    --temp 0

On main, it blows up because of the patch 729:

/Users/alexanderjbrooks/workspace/develop/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:8518: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
zsh: abort      ./build/bin/llama-llava-cli -m $MODEL_GGUF_PATH --mmproj $PROJECT_GGUF_PATH  

On this branch, things are happy:

This image features a beautiful outdoor scene with a clear blue sky and a variety of flowers. The sky is dotted with fluffy white clouds, and there are several trees with pink and white flowers. The image also includes a tall, thin tree with pink flowers and a tall tree with pink flowers. There are also some bushes with pink flowers. The image is rich in detail, with various elements such as the sky, clouds, trees, and flowers.

@ngxson @ggerganov @gabe-l-hart PTAL when you can - this change is needed to run granite vision models correctly as well (being added by this PR), but decoupling the bug fix from the new model support 🙂

@ggerganov ggerganov merged commit ee02ad0 into ggml-org:master Feb 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants