-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGPU Support #11
Comments
This is really cool and paves the way for LLMs running in the browser! I had this idea in my head for a while now: we already have a kind of (primitive) vector DB (just a JSON) and the small model for embeddings. If we added a LLM for Q&A/ text generation based solely on the infos in the text this would be huge! I already talked to the folks from Qdrant on their discord server if they'd be interested in providing a JS/webassembly version of their Rust-based vector DB (as they developed plenty of optimizations) but for the moment they have other priorities. Still, they said they might go for it at some point. Anyway, I think this would make for an interesting POC to explore this. About the idea to integrate it directly, until it's officially supported, we could maybe detect the web-GPU support automatically and simply load the right version? Or does the web-GPU version also support CPU? P.S. There would be so much fun in it for NLP with LLMs if for example we'd created an image of all leitmotifs in the text or some kind of text summary image or similar for a visual understanding of text... |
I am working on a similar effort myself, lets cooperate! More specifically, I wanted to use this project as a basis for an SDK that allows one to run semantic search on their own website's content. |
Sounds great! I was thinking of some kind of bar integrated on top of a webpage like Algolia / lunr etc. do. Good example: on mkdocs material homepage: (By the way, I also had ideas for integrating semantic search in mkdocs, but I'm lacking the time atm...) What about your idea? (We're kind of drifting away from this issue's topic, let's move to discussions: #15) |
We're finally getting closer to WebGPU support: huggingface/transformers.js#545 In my case (M3 Max) I'm getting a massive inferencing speedup of 32x-46x. See for yourself: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark Even with cheap integrated graphics (Intel "integrated" GPUs like UHD or Iris Xe) I get a 4x-8x boost. So literally everyone would see massive speed gains! This is the most notably performance improvement I see atm, hence referencing #49. I hope that transformers.js will allow for some kind of automatic setting where WebGPU is used if available but else falls back to plain CPU. |
Speedup is about 10x for me on an M1. Definitely huge. Not sure how embeddings will compare to inference in terms of GPU optimization but I think there is huge room for parallelization. |
Transformers.js and WebGPUFolks, it's finally here 🥹 However, afaik there is no docs for v3 yet. I tried updating SemanticFinder with v3 and running some quick tests, but failed.
Unfortunately still throws some errors, but I'd say it's better to wait for the official v3 docs. Also it's in alpha at the moment, so errors pretty much expected. |
exciting news! |
@do-me I think also have to change the quantized:true flag to dtype:"f32" for unquantized or dtype:"f16" or "q8" ...etc for quantized. await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
device: 'webgpu',
dtype: 'fp32', // or 'fp16'
}); examples |
@gdmcdonald thanks for the references. Note that in these examples, they used the "old" packages from On my screenshot above you can see that Rather the problems seems to stem from We tightly built the core embedding logic around the old version of transformers.js with callbacks etc. so I guess there is some compatibility problem with the new logic or simply a bug in When I manage to find some time, I will try with the v3 branch in |
Ah ok. I was using |
Found a bug with webgpu (wasm works fine): huggingface/transformers.js#909 The problem is calling the extractor two consecutive times. The first time works (for the query embedding) but the second time fails (for chunk embeddings). |
Folks, it's here! 🥳 There was a simple problem in the old code where I would call I needed to modify this code in f148689. Main changes were in index.js. It's really fast! On my system it indexes the whole bible in like 3mins with a small model like Xenova/all-MiniLM-L6-v2 when before with wasm it would take like 30-40 mins. Not all models are supported, so we should go down that rabbit hole and see whether we can somehow filter the models in index.html for the webgpu branch. I was trying to set up a Github action for the new webgpu branch so it would build the webgpu version and push it to gh-pages in a /webgpu dir but somehow there were errors I couldn't follow up on so far. It somehow overwrote the files in the main directory and did not create the /webgpu dir. You can see my old trials in the history. If someone wants to give a hand it would be highly appreciated :) Anyway, I'm really excited about this change! |
Fantastic news! Just played around and it's working well on my M1. Will followup to see if I can help with errors. |
Finally managed to come up with the correct GitHub Action.
|
According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this. |
@do-me can you tell me how to update the github action as well for my fork of semantic-finder? ty |
Batch size changes everything. It gives me insane speed-ups of more than factor 20x I created a small app based on one of the first versions of SemanticFinder for testing the batch size. In my tests, a chunk size of around 180 chunks per Play with it here: https://geo.rocks/semanticfinder-webgpu/. The current logic in SemanticFinder is more complex than this minimal app, so it takes more time to update everything. Could use a hand here as I probably won't find time until next week. |
Will look into adding it if I get a chance this week. |
GPU Acceleration of transformers is possible, but it is hacky.
Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.
Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.
The text was updated successfully, but these errors were encountered: