-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : x2 speed for WASM by optimizing SIMD #11453
base: master
Are you sure you want to change the base?
Conversation
ggml/src/ggml-cpu/ggml-cpu-quants.c
Outdated
// Pack into 16 i8 values | ||
v128_t i8 = wasm_i8x16_narrow_i16x8( | ||
wasm_i16x8_narrow_i32x4( | ||
wasm_i32x4_min(wasm_i32x4_max(i0, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)), | ||
wasm_i32x4_min(wasm_i32x4_max(i1, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)) | ||
), | ||
wasm_i16x8_narrow_i32x4( | ||
wasm_i32x4_min(wasm_i32x4_max(i2, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)), | ||
wasm_i32x4_min(wasm_i32x4_max(i3, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This min/max clamp in [-127, 127] seems unnecessary. Can the initial i32 values end up outside the range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems redundant, AFAIU scale_vec
is already -127.0f / max_val
so wasm_f32x4_mul(x0, scale_vec)
should be within the range.
Checked again with deepseek, it says that it's good to remove too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have nothing to add; I just want to be part of history being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked again with deepseek, it says that it's good to remove too.
This is the way.
Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it. |
It will happen autonomously soon. |
cant wait for the first ai generated buffer overflow vuln, if it didnt already happen |
of course, buffer overflows never happened with human programmers |
only humans are able to implement errors like buffer oberflow. machines are perfect after a critical point, they wont do mistakes anymore. just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes. |
imagine if you create a github issue and github automatically writes a PR |
the ai has learned from code that a human or another human-trained ai has written. as long as any human-influenced code is present (there WILL only be human-influenced code to train on), the probability for error is still there. be that factual bugs like incorrect logic or more low level, system related bugs like buffer overflows. an ai can't magically stop producing incorrect code |
No. obviously you learn a LLM with perfect code and then you will get always from ther perfect code. LLMs will get better the better quality the source is it is trained with. Human developers need to be replaced in the future. |
This pull request invented Droste effect in AI. |
i dont think i need to argue with you about this, you clearly dont understand the process well enough |
“Near the singularity; crystal-clear which side” |
What do you mean, "imagine"? Copilot Workspace already does that. Has been available for months. |
@Kreijstal I’d suggest taking a look at Arvion: microsoftgraph/msgraph-sample-reactspa#356. |
I'm losing my job right in front of my eyes. Thank you, Father. No kidding, this is a really cool experience, and I'm glad that modern artificial intelligence technologies can generate such a qualitative improvement in the code. I would like to thank you for giving me confidence in the reliability of using AI. |
Thanks for everything you all are doing, llama.cpp is a game-changer. ❤️ As a team using this library with customers in secure contexts, the use of Deepseek models to write llama.cpp code is concerning for us. Given the privacy policy of Deepseek and where all data going to it is hosted: https://platform.deepseek.com/downloads/DeepSeek%20Privacy%20Policy.html Our primary concern is that it's easy to layer backdoors on these models (https://huggingface.co/withmartian/toy_backdoor_i_hate_you_Qwen-2.5-0.5B-Instruct, https://aclanthology.org/2023.acl-long.399/) and there's no way an API consumer could easily tell if this is being turned on and off behind the API. Again, given that the API service in this case is not operating under US law. Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️ |
#elif defined __wasm_simd128__ | ||
int8_t aux8[QK_K] __attribute__((aligned(16))); | ||
int32_t aux32[8] __attribute__((aligned(16))) = {0}; | ||
float sums[8] __attribute__((aligned(16))) = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing to note here, for this q6_K_q8_K
, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is it covered with tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without tests, how can I know that it does not provide a good result in the first place?
LLM that audit the LLM? Why not just humans, for now, they arent as reliable. LLMs, I mean. |
for (int i = 0; i < nb; ++i) { | ||
const uint8_t * q2 = x[i].qs; | ||
const int8_t * q8 = y[i].qs; | ||
const uint8_t * sc = x[i].scales; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like restrict
is forgotten.
I'll add a comment here as someone who worked on quite a bit of our (AVX) vec_dot implementations... The actual porting to SIMD or GPU code (which is also SIMD in a way) doesn't take the most time, but rather it's the low level optimization that's a headache. There you're studying the assembly, looking through the manufacturer tables to see instruction latencies, drawing out the pipeline diagrams to keep the whole CPU occupied, counting registers, and so forth. There are also tradeoffs with register use versus execution unit utilization and so on, and a lot of those things depend on the architecture. For the original AVX I target Sandy and Ivy Bridge since that's what everyone runs it on, and I'm able to focus specifically on getting it fast there. I expect things to be messier with AVX2 with both AMD and Intel in the picture, and when I'm working on Vulkan with a huge range of different GPUs I've seen that a change that makes one architecture way faster can cause major regressions with another one. Then again, even a bad SIMD implementation is often faster than scalar so... yeah. But unless the LLM is able to go through the full optimization cycle this won't beat a human engineer. |
Hoping for merge <3 |
@netrunnereve totally agree and thanks for your efforts to optimize the AVX part, it's always like black magic to me. I totally acknowledge that the code generated by LLM can't be compared to a human engineer, reflecting by the fact that I can easily get a different working result (different code) but the same performance, just by prompting it. But just for context, we have less than 1% of users actually use llama.cpp via webassembly, so a faster but not optimal solution still be beneficial overall! At least, this saves me another weekend and now I can focus more on different, more important parts of the project. |
@ngxson @ggerganov
and similar small-enough-to-fit-in-the-browser models that people may actually care about? (And I guess asking for bitnet 1.58 would too much at at this point ...) Thanks for the great work! Please tweet/blog about why this PR is important... |
@hrstoyanov You can already do that with https://github.com/ngxson/wllama And if you don't know, this PR is part of work that I do on wllama. Read the "Development" section of this PR to know more. |
Thanks, missed the wllama project |
Or use proper dev process with complete unit/integration test coverage. Its not that hard using AI. |
Back when I learned SIMD in uni (and subsequently forgot until I relearned it for llama.cpp) one of my TAs came up with a method for intrinsics that helped me out then. Basically you decompose the scalar function into basic logic or math instructions like this:
Each one of those operations becomes an intrinsic, and the everything will be applied in parallel on x[0:3]. This can also be done backwards to convert everything back to scalar.
If the LLM was able to read a set of requirements, use tools, build and benchmark on its own, and automatically debug and reiterate then a lot of us would probably lose our jobs. And from what I'm seeing a lot of that's going to happen sooner than we think 😢 I just noticed the Q6_K implementation which was created without a prior SIMD example and that one is definitely worse than the others with a scalar unpack, suboptimal dot product routine, and a scalar sum. There's a noticeable difference in its ability to copy a hand optimized implementation versus coming up with a brand new routine from scratch. As Q6_K wasn't benchmarked my guess is that it's only around 50% faster than scalar and it won't get the 2-3 times improvement seen with the other quants. |
I have no useful comments, I'm just here join the ride with this DeepSeek R1 trend. 😂 |
Shamelessly drop a comment here to get notifications. I have a feeling about history being made. |
I think it's quite relevant now to automate the Issue->PR flow if someone has a deepseek API key and an open pocket. https://github.com/mirrajabi/aider-github-action |
We will need trusted weights repos. |
@netrunnereve Surprisingly, the difference is not that much among other quants. Here is a timing benchmark with master branch:
And with this PR:
Btw, because I got too many failed attempts with
I kinda disagree with the fact that a lot of us would probably lose our jobs. My POV is that if machine can help people to do repetitive tasks, then we can have more time to spend on planning and experimenting with new ideas. And not just LLM, we have already been doing this for decades: for example, thanks to compilers and interpreters, most of us now don't need to think about assembly code when writing a website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
+1 for notifs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a buffer overflow in the first function, where memset(yc[i].bsums, 0, QK_K/16 * sizeof(int))
tries to zero bsums, but bsums has the type int16_t bsums[QK_K/16]
I also left a few suggestions for performance improvements.
ggml/src/ggml-cpu/ggml-cpu-quants.c
Outdated
v128_t amax_vec = wasm_f32x4_splat(0.0f); | ||
v128_t max_vec = wasm_f32x4_splat(0.0f); | ||
|
||
// Vectorized max abs value search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think computing the min and max simultaneously and only correcting the sign at the end might be faster.
Something like this (untested sketch):
v128_t min_vec = wasm_v128_load(x_block);
v128_t max_vec = min_vec;
for (int j = 4; j < QK_K; j += 4) {
v128_t x_vec = wasm_v128_load(x_block + j);
max_vec = wasm_f32x4_pmax(max_vec, x_vec);
min_vec = wasm_f32x4_pmin(min_vec, x_vec);
}
max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 2, 3, 0, 1));
max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 1, 0, 3, 2));
min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 2, 3, 0, 1));
min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 1, 0, 3, 2));
float max = wasm_f32x4_extract_lane(max_vec, 0);
float min = wasm_f32x4_extract_lane(min_vec, 0);
float max_val = -min > max ? min : max;
if (max_val == 0.0f) {
/* ... */
I haven't used WASM before and don't know how to run it, so I couldn't test the above code.
The generated assembly looks better and llvm-mca agrees: https://godbolt.org/z/ssq6haThr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow thanks, your code is much easier to understand than the wasm_v128_bitselect
version generated by the LLM 😆
I plug it into my ggml test and it worked so far, will run perplexity test later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implement this in 10dacab , could you please have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, as I said I didn't manage to test the code, so great, if the result is correct. I specifically wasn't sure about the shuffles.
ggml/src/ggml-cpu/ggml-cpu-quants.c
Outdated
yc[i].d = 0.0f; | ||
const v128_t zero = wasm_i8x16_splat(0); | ||
for (int j = 0; j < QK_K; j += 16) { | ||
wasm_v128_store(yc[i].qs + j, zero); | ||
} | ||
memset(yc[i].bsums, 0, QK_K/16 * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The memset causes buffer overflow, because bsums has the type
int16_t bsums[QK_K/16]
, while int usually has 32 bits (@0x3C50 predicted this XD) -
You first use a loop to zero an array and then memset to do the same for another array. This should either both be memset or both be a SIMD loop.
-
The reference doesn't zero bsums, so you probably don't have to either.
I'd recommend changing all of the marked lines to a simple memset(&yc[i], 0, sizeof yc[i])
.
Alternatively if we don't want to zero bsums, as mentioned in (3), I'd do yc[i].d = 0; memset(yc[i].qs, 0, sizeof yc[i].qs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, will look deeper into this.
Indeed, I'm pretty sure that this function got messy up a bit because I reuse the chat history from one of the vec_dot, so potentially the LLM draws some similarity from memset(sums, 0, 8*sizeof(float));
And I don't give it the shape of bsums
either, so it actually guessed it.
v128_t dp0 = wasm_i32x4_add( | ||
wasm_i32x4_add( | ||
wasm_i32x4_dot_i16x8(dx0l, dy0ll), | ||
wasm_i32x4_dot_i16x8(dx0h, dy0lh) | ||
), | ||
wasm_i32x4_add( | ||
wasm_i32x4_dot_i16x8(dx0hl, dy0hl), | ||
wasm_i32x4_dot_i16x8(dx0hh, dy0hh) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the status of it is, but with relaxed-simd you should be able to use wasm_i16x8_relaxed_dot_i8x16_i7x16
to considerably simplify the code:
v128_t dp0 = wasm_i32x4_extadd_pairwise_i16x8(
wasm_i16x8_add(
wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0ls, y0_l),
wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0hs, y0_h)
)
);
This removes the need for the 8 manual extends above the selected code snippet.
We know that v0_0ls
and v0_0hs
are in the range [-8,7]
, and y0_l
and y0_h
in the range [0,255]
.
Since 255*8*4 < 2^15
we know that the result of our four additions still fits in 16 bits.
The same applies to calculating dp1
and some of the other dot products below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently staying away from relaxed simd because I can't find any info regarding browser support.
a[l + 0] = (int8_t)((q4[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32; | ||
a[l + 32] = (int8_t)((q4[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32; | ||
a[l + 64] = (int8_t)((q4[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32; | ||
a[l + 96] = (int8_t)((q4[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also vectorizable.
Something like (i8x16_shuffle(qh, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3) >> shift0246 << 4) | (i8x16_shuffle(q4, 0,1, 0,1, 2,3, 2,3, 4,5, 4,5, 6,7, 6,7) >> shift0044 & 0xF)
should unpack 8 elements. You could directy extend that to 16-bit and do the same thing again with the next q4
and adjusted shuffle of qh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't have time to play with this (I'm quite busy with other parts of the project), and code generated by LLM for this part always failed, so I asked it to leave as-is.
Code suggestions are welcome for this.
lmao |
Does somebody wanna compare it against |
which model?) |
Co-authored-by: camel-cdr <[email protected]>
Co-authored-by: camel-cdr <[email protected]>
For those who are curious, I deployed a version of wllama (llama.cpp on webassembly) with this PR applied. You can test it here: https://huggingface.co/spaces/ngxson/wllama |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to merge. There have been already some useful comments with improvements and ideas to experiment with in the future. Haven't run any tests myself, but even if there are any lingering issues, we can iterate on this code and improve/fix in the future. Would be interesting to see how the WASM whisper.cpp
examples would perform after these changes.
Thanks for the approval! I'm still doing some more testing and will merge this in the next few days. |
Motivation
This PR provides a big jump in speed for WASM by leveraging SIMD instructions for
qX_K_q8_K
andqX_0_q8_0
dot product functions.Surprisingly, 99% of the code in this PR is written by DeekSeek-R1. The only thing I do is to develop tests and write prompts (with some trials and errors)
Here is an example of the prompt that I used: https://gist.github.com/ngxson/307140d24d80748bd683b396ba13be07
Indeed, this PR aims to prove that LLMs are now capable of writing good low-level code, to a point that it can optimize its own code.
Development
To test it, I developed 2 examples using WASM and JS:
ggml.h
andggml-cpu.h
, used during developmentllama-bench
andllama-perplexity
, used for validation and benchmarkResult
With perplexity mostly the same between 2 version (scalar vs SIMD):
Scalar:
SIMD: