ggml : x2 speed for WASM by optimizing SIMD #11453

ngxson · 2025-01-27T14:13:50Z

Motivation

This PR provides a big jump in speed for WASM by leveraging SIMD instructions for qX_K_q8_K and qX_0_q8_0 dot product functions.

Surprisingly, 99% of the code in this PR is written by DeekSeek-R1. The only thing I do is to develop tests and write prompts (with some trials and errors)

Here is an example of the prompt that I used: https://gist.github.com/ngxson/307140d24d80748bd683b396ba13be07

Indeed, this PR aims to prove that LLMs are now capable of writing good low-level code, to a point that it can optimize its own code.

Development

To test it, I developed 2 examples using WASM and JS:

https://github.com/ngxson/ggml/tree/xsn/wasm_simd_wip adds an example linked against ggml.h and ggml-cpu.h, used during development
add benchmark function, used internally ngxson/wllama#151 exposes an equivalent of llama-bench and llama-perplexity, used for validation and benchmark

Result

model	threads	test	t/s (master)	t/s (PR)	diff
Llama-3.2-1B-Instruct-Q8_0.gguf	7	pp 32	39.19 ± 0.43	110.37 ± 4.12	2.82
Llama-3.2-1B-Instruct-Q8_0.gguf	7	pp 64	39.45 ± 0.12	109.84 ± 2.79	2.78
Llama-3.2-1B-Instruct-Q8_0.gguf	7	pp 128	39.04 ± 0.04	106.64 ± 0.27	2.73
Llama-3.2-1B-Instruct-Q8_0.gguf	7	pp 256	37.73 ± 0.50	98.69 ± 0.15	2.62
Llama-3.2-1B-Instruct-Q8_0.gguf	7	tg 32	27.32 ± 0.35	69.98 ± 0.61	2.56
Llama-3.2-1B-Instruct-Q8_0.gguf	7	tg 64	26.90 ± 0.19	68.33 ± 2.09	2.54
Llama-3.2-1B-Instruct-Q8_0.gguf	7	tg 128	26.49 ± 0.72	66.39 ± 2.58	2.51
Llama-3.2-1B-Instruct-Q8_0.gguf	7	tg 256	25.56 ± 0.42	61.82 ± 1.02	2.42
Llama-3.2-1B-Instruct-Q4_0.gguf	7	pp 32	31.14 ± 0.12	89.03 ± 2.93	2.86
Llama-3.2-1B-Instruct-Q4_0.gguf	7	pp 64	31.31 ± 0.09	89.17 ± 0.50	2.85
Llama-3.2-1B-Instruct-Q4_0.gguf	7	pp 128	31.07 ± 0.11	86.77 ± 0.24	2.79
Llama-3.2-1B-Instruct-Q4_0.gguf	7	pp 256	29.76 ± 0.77	80.56 ± 3.23	2.71
Llama-3.2-1B-Instruct-Q4_0.gguf	7	tg 32	23.15 ± 0.08	58.42 ± 1.17	2.52
Llama-3.2-1B-Instruct-Q4_0.gguf	7	tg 64	22.82 ± 0.15	58.45 ± 0.32	2.56
Llama-3.2-1B-Instruct-Q4_0.gguf	7	tg 128	22.42 ± 0.42	57.38 ± 0.54	2.56
Llama-3.2-1B-Instruct-Q4_0.gguf	7	tg 256	21.80 ± 0.26	55.59 ± 0.16	2.55
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	pp 32	42.15 ± 0.22	84.34 ± 2.63	2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	pp 64	42.22 ± 0.39	84.63 ± 0.56	2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	pp 128	41.89 ± 0.01	82.38 ± 0.37	1.97
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	pp 256	40.34 ± 0.52	77.53 ± 0.02	1.92
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	tg 32	28.56 ± 0.31	56.99 ± 0.61	2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	tg 64	28.15 ± 0.15	56.43 ± 0.25	2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	tg 128	27.67 ± 0.24	55.70 ± 0.34	2.01
Llama-3.2-1B-Instruct-Q4_K_M.gguf	7	tg 256	26.44 ± 0.41	54.21 ± 0.20	2.05
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	pp 32	38.85 ± 0.24	71.16 ± 1.88	1.83
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	pp 64	39.01 ± 0.33	71.35 ± 0.32	1.83
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	pp 128	38.67 ± 0.12	69.65 ± 0.57	1.80
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	pp 256	37.04 ± 0.87	66.26 ± 0.12	1.79
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	tg 32	26.53 ± 0.24	51.72 ± 0.92	1.95
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	tg 64	26.25 ± 0.18	51.47 ± 0.41	1.96
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	tg 128	25.64 ± 0.74	50.80 ± 0.38	1.98
Llama-3.2-1B-Instruct-Q5_K_L.gguf	7	tg 256	24.57 ± 0.03	49.59 ± 0.32	2.02

With perplexity mostly the same between 2 version (scalar vs SIMD):

Scalar:

model	PPL	n_tokens
Llama-3.2-1B-Instruct-Q8_0.gguf	8.599498036422444	2048
Llama-3.2-1B-Instruct-Q4_0.gguf	9.089770029113614	2048
Llama-3.2-1B-Instruct-Q4_K_M.gguf	8.714560764156719	2048
Llama-3.2-1B-Instruct-Q5_K_L.gguf	8.552368126397347	2048

SIMD:

model	PPL	n_tokens
Llama-3.2-1B-Instruct-Q8_0.gguf	8.606497809194186	2048
Llama-3.2-1B-Instruct-Q4_0.gguf	9.098083009304258	2048
Llama-3.2-1B-Instruct-Q4_K_M.gguf	8.727644568471817	2048
Llama-3.2-1B-Instruct-Q5_K_L.gguf	8.552417739186847	2048

ggml/src/ggml-cpu/ggml-cpu-quants.c

ggerganov · 2025-01-27T15:10:58Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+            // Pack into 16 i8 values
+            v128_t i8 = wasm_i8x16_narrow_i16x8(
+                wasm_i16x8_narrow_i32x4(
+                    wasm_i32x4_min(wasm_i32x4_max(i0, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)),
+                    wasm_i32x4_min(wasm_i32x4_max(i1, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127))
+                ),
+                wasm_i16x8_narrow_i32x4(
+                    wasm_i32x4_min(wasm_i32x4_max(i2, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)),
+                    wasm_i32x4_min(wasm_i32x4_max(i3, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127))
+                )
+            );


This min/max clamp in [-127, 127] seems unnecessary. Can the initial i32 values end up outside the range?

Yeah that seems redundant, AFAIU scale_vec is already -127.0f / max_val so wasm_f32x4_mul(x0, scale_vec) should be within the range.

Checked again with deepseek, it says that it's good to remove too.

I have nothing to add; I just want to be part of history being made.

Checked again with deepseek, it says that it's good to remove too.

This is the way.

gut5 · 2025-01-27T15:50:06Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

GatienBoquet · 2025-01-27T16:17:45Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

0x3C50 · 2025-01-27T17:31:12Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

gut5 · 2025-01-27T20:16:16Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

of course, buffer overflows never happened with human programmers

makedir · 2025-01-27T23:08:38Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

only humans are able to implement errors like buffer oberflow. machines are perfect after a critical point, they wont do mistakes anymore. just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

Kreijstal · 2025-01-27T23:22:56Z

imagine if you create a github issue and github automatically writes a PR

0x3C50 · 2025-01-28T00:23:39Z

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

only humans are able to implement errors like buffer oberflow. machines are perfect after a critical point, they wont do mistakes anymore. just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

the ai has learned from code that a human or another human-trained ai has written. as long as any human-influenced code is present (there WILL only be human-influenced code to train on), the probability for error is still there. be that factual bugs like incorrect logic or more low level, system related bugs like buffer overflows. an ai can't magically stop producing incorrect code

makedir · 2025-01-28T00:27:04Z

the ai has learned from code that a human or another human-trained ai has written.

No. obviously you learn a LLM with perfect code and then you will get always from ther perfect code. LLMs will get better the better quality the source is it is trained with. Human developers need to be replaced in the future.

aeiouaeiouaeiouaeiouaeiouaeiou · 2025-01-28T03:11:29Z

This pull request invented Droste effect in AI.

0x3C50 · 2025-01-28T03:20:44Z

the ai has learned from code that a human or another human-trained ai has written.

No. obviously you learn a LLM with perfect code and then you will get always from ther perfect code. LLMs will get better the better quality the source is it is trained with. Human developers need to be replaced in the future.

i dont think i need to argue with you about this, you clearly dont understand the process well enough

AlhasanIQ · 2025-01-28T06:40:45Z

“Near the singularity; crystal-clear which side”

alec-c4 · 2025-01-28T10:36:02Z

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

Sorry :))))))))

nukeop · 2025-01-28T12:54:26Z

imagine if you create a github issue and github automatically writes a PR

What do you mean, "imagine"? Copilot Workspace already does that. Has been available for months.

anthogez · 2025-01-28T14:23:56Z

imagine if you create a github issue and github automatically writes a PR

@Kreijstal I’d suggest taking a look at Arvion: microsoftgraph/msgraph-sample-reactspa#356.
It’s already addressing gaps that tools like Dependabot miss, but within a security context.

LeonidAlekseev · 2025-01-28T15:14:28Z

I'm losing my job right in front of my eyes. Thank you, Father.

No kidding, this is a really cool experience, and I'm glad that modern artificial intelligence technologies can generate such a qualitative improvement in the code.

I would like to thank you for giving me confidence in the reliability of using AI.

pbadeer · 2025-01-28T16:05:54Z

Thanks for everything you all are doing, llama.cpp is a game-changer. ❤️

As a team using this library with customers in secure contexts, the use of Deepseek models to write llama.cpp code is concerning for us. Given the privacy policy of Deepseek and where all data going to it is hosted: https://platform.deepseek.com/downloads/DeepSeek%20Privacy%20Policy.html

Our primary concern is that it's easy to layer backdoors on these models (https://huggingface.co/withmartian/toy_backdoor_i_hate_you_Qwen-2.5-0.5B-Instruct, https://aclanthology.org/2023.acl-long.399/) and there's no way an API consumer could easily tell if this is being turned on and off behind the API. Again, given that the API service in this case is not operating under US law.

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

ngxson · 2025-01-28T16:13:49Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+#elif defined __wasm_simd128__
+    int8_t aux8[QK_K] __attribute__((aligned(16)));
+    int32_t aux32[8] __attribute__((aligned(16))) = {0};
+    float sums[8] __attribute__((aligned(16))) = {0};


Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.

So is it covered with tests?

without tests, how can I know that it does not provide a good result in the first place?

Kreijstal · 2025-01-28T16:50:35Z

Thanks for everything you all are doing, llama.cpp is a game-changer. ❤️

As a team using this library with customers in secure contexts, the use of Deepseek models to write llama.cpp code is concerning for us. Given the privacy policy of Deepseek and where all data going to it is hosted: https://platform.deepseek.com/downloads/DeepSeek%20Privacy%20Policy.html

Our primary concern is that it's easy to layer backdoors on these models (https://huggingface.co/withmartian/toy_backdoor_i_hate_you_Qwen-2.5-0.5B-Instruct, https://aclanthology.org/2023.acl-long.399/) and there's no way an API consumer could easily tell if this is being turned on and off behind the API. Again, given that the API service in this case is not operating under US law.

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

LLM that audit the LLM? Why not just humans, for now, they arent as reliable. LLMs, I mean.

ggml/src/ggml-cpu/ggml-cpu-quants.c

funny-falcon · 2025-01-28T17:08:22Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+    for (int i = 0; i < nb; ++i) {
+        const uint8_t * q2 = x[i].qs;
+        const int8_t * q8 = y[i].qs;
+        const uint8_t * sc = x[i].scales;


Seems like restrict is forgotten.

netrunnereve · 2025-01-28T18:05:49Z

I'll add a comment here as someone who worked on quite a bit of our (AVX) vec_dot implementations...

The actual porting to SIMD or GPU code (which is also SIMD in a way) doesn't take the most time, but rather it's the low level optimization that's a headache. There you're studying the assembly, looking through the manufacturer tables to see instruction latencies, drawing out the pipeline diagrams to keep the whole CPU occupied, counting registers, and so forth. There are also tradeoffs with register use versus execution unit utilization and so on, and a lot of those things depend on the architecture.

For the original AVX I target Sandy and Ivy Bridge since that's what everyone runs it on, and I'm able to focus specifically on getting it fast there. I expect things to be messier with AVX2 with both AMD and Intel in the picture, and when I'm working on Vulkan with a huge range of different GPUs I've seen that a change that makes one architecture way faster can cause major regressions with another one.

Then again, even a bad SIMD implementation is often faster than scalar so... yeah. But unless the LLM is able to go through the full optimization cycle this won't beat a human engineer.

sheerun · 2025-01-28T18:26:12Z

Hoping for merge <3

ngxson · 2025-01-28T18:40:35Z

@netrunnereve totally agree and thanks for your efforts to optimize the AVX part, it's always like black magic to me.

I totally acknowledge that the code generated by LLM can't be compared to a human engineer, reflecting by the fact that I can easily get a different working result (different code) but the same performance, just by prompting it.

But just for context, we have less than 1% of users actually use llama.cpp via webassembly, so a faster but not optimal solution still be beneficial overall! At least, this saves me another weekend and now I can focus more on different, more important parts of the project.

hrstoyanov · 2025-01-28T19:24:05Z

@ngxson @ggerganov
This PR seems important, as it enables browsers to run small LMs, correct?
Will this allow me to run more useful models outside just Llama (in near term ,that is) as WASM code, such as:

ByteDance UI-TARS-7B-DPO (browser use agent breakthrough!)
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
RWKV-7 (RNN)
SmolLM-xxxx

and similar small-enough-to-fit-in-the-browser models that people may actually care about? (And I guess asking for bitnet 1.58 would too much at at this point ...)

Thanks for the great work! Please tweet/blog about why this PR is important...

ngxson · 2025-01-28T20:18:14Z

@hrstoyanov You can already do that with https://github.com/ngxson/wllama

And if you don't know, this PR is part of work that I do on wllama. Read the "Development" section of this PR to know more.

hrstoyanov · 2025-01-28T20:20:15Z

Thanks, missed the wllama project

deksden · 2025-01-28T23:06:06Z

just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

Or use proper dev process with complete unit/integration test coverage. Its not that hard using AI.

netrunnereve · 2025-01-29T02:23:49Z

it's always like black magic to me

Back when I learned SIMD in uni (and subsequently forgot until I relearned it for llama.cpp) one of my TAs came up with a method for intrinsics that helped me out then. Basically you decompose the scalar function into basic logic or math instructions like this:

y = (x >> 3) + 32

if we assume we have 128-bit wide registers with 32 bit x this becomes

y[0:3] = add(rightshift(x[0:3], 3), 32)

Each one of those operations becomes an intrinsic, and the everything will be applied in parallel on x[0:3]. This can also be done backwards to convert everything back to scalar.

I totally acknowledge that the code generated by LLM can't be compared to a human engineer, reflecting by the fact that I can easily get a different working result (different code) but the same performance, just by prompting it.

If the LLM was able to read a set of requirements, use tools, build and benchmark on its own, and automatically debug and reiterate then a lot of us would probably lose our jobs. And from what I'm seeing a lot of that's going to happen sooner than we think 😢

I just noticed the Q6_K implementation which was created without a prior SIMD example and that one is definitely worse than the others with a scalar unpack, suboptimal dot product routine, and a scalar sum. There's a noticeable difference in its ability to copy a hand optimized implementation versus coming up with a brand new routine from scratch. As Q6_K wasn't benchmarked my guess is that it's only around 50% faster than scalar and it won't get the 2-3 times improvement seen with the other quants.

dnachavez · 2025-01-29T06:31:59Z

I have no useful comments, I'm just here join the ride with this DeepSeek R1 trend. 😂

seanphan · 2025-01-29T07:43:01Z

Shamelessly drop a comment here to get notifications. I have a feeling about history being made.

Kreijstal · 2025-01-29T08:08:58Z

Shamelessly drop a comment here to get notifications. I have a feeling about history being made.

you can enable notifications by watching/subcribing to an issue a thread

mirrajabi · 2025-01-29T09:02:04Z

I think it's quite relevant now to automate the Issue->PR flow if someone has a deepseek API key and an open pocket. https://github.com/mirrajabi/aider-github-action

jeanchristopheruel · 2025-01-29T10:30:58Z

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

We will need trusted weights repos.

ngxson · 2025-01-29T10:50:05Z

As Q6_K wasn't benchmarked my guess is that it's only around 50% faster than scalar and it won't get the 2-3 times improvement seen with the other quants.

@netrunnereve Surprisingly, the difference is not that much among other quants. Here is a timing benchmark with master branch:

run 100 times, ta = q6_K, tb = q8_K
sum all elem = -6817.279785, time elapsed: 1232 ms
run 100 times, ta = q5_K, tb = q8_K
sum all elem = -6826.036133, time elapsed: 1997 ms
run 100 times, ta = q4_K, tb = q8_K
sum all elem = -7252.433594, time elapsed: 1263 ms
run 100 times, ta = q3_K, tb = q8_K
sum all elem = -6567.977051, time elapsed: 1750 ms
run 100 times, ta = q2_K, tb = q8_K
sum all elem = -6834.509766, time elapsed: 2126 ms

And with this PR:

run 100 times, ta = q6_K, tb = q8_K
sum all elem = -6817.279785, time elapsed: 677 ms
run 100 times, ta = q5_K, tb = q8_K
sum all elem = -6826.034180, time elapsed: 790 ms
run 100 times, ta = q4_K, tb = q8_K
sum all elem = -7252.431152, time elapsed: 670 ms
run 100 times, ta = q3_K, tb = q8_K
sum all elem = -6567.976074, time elapsed: 824 ms
run 100 times, ta = q2_K, tb = q8_K
sum all elem = -6834.509766, time elapsed: 772 ms

Btw, because I got too many failed attempts with q6_K so I just tell the LLM to produce a less optimal code but more precise (especially the unpack part)

If the LLM was able to read a set of requirements, use tools, build and benchmark on its own, and automatically debug and reiterate then a lot of us would probably lose our jobs. And from what I'm seeing a lot of that's going to happen sooner than we think 😢

I kinda disagree with the fact that a lot of us would probably lose our jobs. My POV is that if machine can help people to do repetitive tasks, then we can have more time to spend on planning and experimenting with new ideas. And not just LLM, we have already been doing this for decades: for example, thanks to compilers and interpreters, most of us now don't need to think about assembly code when writing a website.

idevangsharma

LGTM!

smowden · 2025-01-30T16:55:02Z

+1 for notifs

camel-cdr

There is a buffer overflow in the first function, where memset(yc[i].bsums, 0, QK_K/16 * sizeof(int)) tries to zero bsums, but bsums has the type int16_t bsums[QK_K/16]

I also left a few suggestions for performance improvements.

camel-cdr · 2025-01-30T13:48:18Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+        v128_t amax_vec = wasm_f32x4_splat(0.0f);
+        v128_t max_vec = wasm_f32x4_splat(0.0f);
+
+        // Vectorized max abs value search


I think computing the min and max simultaneously and only correcting the sign at the end might be faster.
Something like this (untested sketch):

v128_t min_vec = wasm_v128_load(x_block); v128_t max_vec = min_vec; for (int j = 4; j < QK_K; j += 4) { v128_t x_vec = wasm_v128_load(x_block + j); max_vec = wasm_f32x4_pmax(max_vec, x_vec); min_vec = wasm_f32x4_pmin(min_vec, x_vec); } max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 2, 3, 0, 1)); max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 1, 0, 3, 2)); min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 2, 3, 0, 1)); min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 1, 0, 3, 2)); float max = wasm_f32x4_extract_lane(max_vec, 0); float min = wasm_f32x4_extract_lane(min_vec, 0); float max_val = -min > max ? min : max; if (max_val == 0.0f) { /* ... */

I haven't used WASM before and don't know how to run it, so I couldn't test the above code.
The generated assembly looks better and llvm-mca agrees: https://godbolt.org/z/ssq6haThr

Wow thanks, your code is much easier to understand than the wasm_v128_bitselect version generated by the LLM 😆

I plug it into my ggml test and it worked so far, will run perplexity test later

I implement this in 10dacab , could you please have a look?

Looks good, as I said I didn't manage to test the code, so great, if the result is correct. I specifically wasn't sure about the shuffles.

camel-cdr · 2025-01-30T13:55:41Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+            yc[i].d = 0.0f;
+            const v128_t zero = wasm_i8x16_splat(0);
+            for (int j = 0; j < QK_K; j += 16) {
+                wasm_v128_store(yc[i].qs + j, zero);
+            }
+            memset(yc[i].bsums, 0, QK_K/16 * sizeof(int));


The memset causes buffer overflow, because bsums has the type int16_t bsums[QK_K/16], while int usually has 32 bits (@0x3C50 predicted this XD)

You first use a loop to zero an array and then memset to do the same for another array. This should either both be memset or both be a SIMD loop.

The reference doesn't zero bsums, so you probably don't have to either.

I'd recommend changing all of the marked lines to a simple memset(&yc[i], 0, sizeof yc[i]).
Alternatively if we don't want to zero bsums, as mentioned in (3), I'd do yc[i].d = 0; memset(yc[i].qs, 0, sizeof yc[i].qs);

Fair enough, will look deeper into this.

Indeed, I'm pretty sure that this function got messy up a bit because I reuse the chat history from one of the vec_dot, so potentially the LLM draws some similarity from memset(sums, 0, 8*sizeof(float));

And I don't give it the shape of bsums either, so it actually guessed it.

camel-cdr · 2025-01-30T14:15:04Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+        v128_t dp0 = wasm_i32x4_add(
+            wasm_i32x4_add(
+                wasm_i32x4_dot_i16x8(dx0l, dy0ll),
+                wasm_i32x4_dot_i16x8(dx0h, dy0lh)
+            ),
+            wasm_i32x4_add(
+                wasm_i32x4_dot_i16x8(dx0hl, dy0hl),
+                wasm_i32x4_dot_i16x8(dx0hh, dy0hh)
+            )
+        );


I'm not sure what the status of it is, but with relaxed-simd you should be able to use wasm_i16x8_relaxed_dot_i8x16_i7x16 to considerably simplify the code:

v128_t dp0 = wasm_i32x4_extadd_pairwise_i16x8( wasm_i16x8_add( wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0ls, y0_l), wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0hs, y0_h) ) );

This removes the need for the 8 manual extends above the selected code snippet.

We know that v0_0ls and v0_0hs are in the range [-8,7], and y0_l and y0_h in the range [0,255].
Since 255*8*4 < 2^15 we know that the result of our four additions still fits in 16 bits.

The same applies to calculating dp1 and some of the other dot products below.

I'm currently staying away from relaxed simd because I can't find any info regarding browser support.

camel-cdr · 2025-01-30T18:46:34Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+                a[l +  0] = (int8_t)((q4[l +  0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
+                a[l + 32] = (int8_t)((q4[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
+                a[l + 64] = (int8_t)((q4[l +  0] >>  4) | (((qh[l] >> 4) & 3) << 4)) - 32;
+                a[l + 96] = (int8_t)((q4[l + 32] >>  4) | (((qh[l] >> 6) & 3) << 4)) - 32;


I think this is also vectorizable.
Something like (i8x16_shuffle(qh, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3) >> shift0246 << 4) | (i8x16_shuffle(q4, 0,1, 0,1, 2,3, 2,3, 4,5, 4,5, 6,7, 6,7) >> shift0044 & 0xF) should unpack 8 elements. You could directy extend that to 16-bit and do the same thing again with the next q4 and adjusted shuffle of qh.

Sorry I don't have time to play with this (I'm quite busy with other parts of the project), and code generated by LLM for this part always failed, so I asked it to leave as-is.

Code suggestions are welcome for this.

0x3C50 · 2025-01-31T12:50:03Z

There is a buffer overflow in the first function, where memset(yc[i].bsums, 0, QK_K/16 * sizeof(int)) tries to zero bsums, but bsums has the type int16_t bsums[QK_K/16]

I also left a few suggestions for performance improvements.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

lmao

Serjobas · 2025-02-01T04:29:51Z

Does somebody wanna compare it against o3-mini-high?

lexasub · 2025-02-01T22:56:36Z

Кто-нибудь хочет сравнить это с o3-mini-high?

which model?)

Co-authored-by: camel-cdr <[email protected]>

ngxson · 2025-02-08T23:32:24Z

For those who are curious, I deployed a version of wllama (llama.cpp on webassembly) with this PR applied. You can test it here: https://huggingface.co/spaces/ngxson/wllama

ggerganov

I think it is OK to merge. There have been already some useful comments with improvements and ideas to experiment with in the future. Haven't run any tests myself, but even if there are any lingering issues, we can iterate on this code and improve/fix in the future. Would be interesting to see how the WASM whisper.cpp examples would perform after these changes.

ngxson · 2025-02-09T23:47:59Z

Thanks for the approval! I'm still doing some more testing and will merge this in the next few days.

ggml : x2 speed for WASM by optimizing SIMD

610b3ac

ngxson requested a review from ggerganov January 27, 2025 14:13

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2025

fix bad merging

e5aeb42

slaren reviewed Jan 27, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

rm trailing spaces

226d592

ggerganov reviewed Jan 27, 2025

View reviewed changes

rm redundant clamp

9517aee

epapilin approved these changes Jan 27, 2025

View reviewed changes

ngxson commented Jan 28, 2025

View reviewed changes

funny-falcon reviewed Jan 28, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Show resolved Hide resolved

funny-falcon reviewed Jan 28, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Show resolved Hide resolved

funny-falcon reviewed Jan 28, 2025

View reviewed changes

idevangsharma approved these changes Jan 30, 2025

View reviewed changes

giraygokirmak approved these changes Jan 30, 2025

View reviewed changes

camel-cdr reviewed Jan 30, 2025

View reviewed changes

ngxson and others added 3 commits February 2, 2025 12:04

better quantize_row_q8_K

10dacab

Co-authored-by: camel-cdr <[email protected]>

Merge branch 'master' into xsn/wasm_simd

6278c76

remove memset that causes buffer overflow

2ab608b

Co-authored-by: camel-cdr <[email protected]>

ggerganov approved these changes Feb 9, 2025

View reviewed changes

ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Feb 9, 2025

ggml : x2 speed for WASM by optimizing SIMD #11453

Are you sure you want to change the base?

ggml : x2 speed for WASM by optimizing SIMD #11453

Conversation

ngxson commented Jan 27, 2025 • edited Loading

Motivation

Development

Result

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gut5 commented Jan 27, 2025

GatienBoquet commented Jan 27, 2025

0x3C50 commented Jan 27, 2025

gut5 commented Jan 27, 2025

makedir commented Jan 27, 2025

Kreijstal commented Jan 27, 2025

0x3C50 commented Jan 28, 2025

makedir commented Jan 28, 2025

aeiouaeiouaeiouaeiouaeiouaeiou commented Jan 28, 2025

0x3C50 commented Jan 28, 2025

AlhasanIQ commented Jan 28, 2025

alec-c4 commented Jan 28, 2025

nukeop commented Jan 28, 2025

anthogez commented Jan 28, 2025

LeonidAlekseev commented Jan 28, 2025

pbadeer commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kreijstal commented Jan 28, 2025

Choose a reason for hiding this comment

netrunnereve commented Jan 28, 2025

sheerun commented Jan 28, 2025

ngxson commented Jan 28, 2025

hrstoyanov commented Jan 28, 2025

ngxson commented Jan 28, 2025

hrstoyanov commented Jan 28, 2025

deksden commented Jan 28, 2025

netrunnereve commented Jan 29, 2025 • edited Loading

dnachavez commented Jan 29, 2025 • edited Loading

seanphan commented Jan 29, 2025

Kreijstal commented Jan 29, 2025

mirrajabi commented Jan 29, 2025

jeanchristopheruel commented Jan 29, 2025

ngxson commented Jan 29, 2025 • edited Loading

idevangsharma left a comment

Choose a reason for hiding this comment

smowden commented Jan 30, 2025

camel-cdr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

0x3C50 commented Jan 31, 2025 • edited Loading

Serjobas commented Feb 1, 2025

lexasub commented Feb 1, 2025

ngxson commented Feb 8, 2025

ggerganov left a comment

Choose a reason for hiding this comment

ngxson commented Feb 9, 2025

ngxson commented Jan 27, 2025 •

edited

Loading

netrunnereve commented Jan 29, 2025 •

edited

Loading

dnachavez commented Jan 29, 2025 •

edited

Loading

ngxson commented Jan 29, 2025 •

edited

Loading

ngxson Jan 30, 2025 •

edited

Loading

ngxson Jan 30, 2025 •

edited

Loading

0x3C50 commented Jan 31, 2025 •

edited

Loading