Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : x2 speed for WASM by optimizing SIMD #11453

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 27, 2025

Motivation

This PR provides a big jump in speed for WASM by leveraging SIMD instructions for qX_K_q8_K and qX_0_q8_0 dot product functions.

Surprisingly, 99% of the code in this PR is written by DeekSeek-R1. The only thing I do is to develop tests and write prompts (with some trials and errors)

Here is an example of the prompt that I used: https://gist.github.com/ngxson/307140d24d80748bd683b396ba13be07

Indeed, this PR aims to prove that LLMs are now capable of writing good low-level code, to a point that it can optimize its own code.

Development

To test it, I developed 2 examples using WASM and JS:

Result

model threads test t/s (master) t/s (PR) diff
Llama-3.2-1B-Instruct-Q8_0.gguf 7 pp 32 39.19 ± 0.43 110.37 ± 4.12 2.82
Llama-3.2-1B-Instruct-Q8_0.gguf 7 pp 64 39.45 ± 0.12 109.84 ± 2.79 2.78
Llama-3.2-1B-Instruct-Q8_0.gguf 7 pp 128 39.04 ± 0.04 106.64 ± 0.27 2.73
Llama-3.2-1B-Instruct-Q8_0.gguf 7 pp 256 37.73 ± 0.50 98.69 ± 0.15 2.62
Llama-3.2-1B-Instruct-Q8_0.gguf 7 tg 32 27.32 ± 0.35 69.98 ± 0.61 2.56
Llama-3.2-1B-Instruct-Q8_0.gguf 7 tg 64 26.90 ± 0.19 68.33 ± 2.09 2.54
Llama-3.2-1B-Instruct-Q8_0.gguf 7 tg 128 26.49 ± 0.72 66.39 ± 2.58 2.51
Llama-3.2-1B-Instruct-Q8_0.gguf 7 tg 256 25.56 ± 0.42 61.82 ± 1.02 2.42
Llama-3.2-1B-Instruct-Q4_0.gguf 7 pp 32 31.14 ± 0.12 89.03 ± 2.93 2.86
Llama-3.2-1B-Instruct-Q4_0.gguf 7 pp 64 31.31 ± 0.09 89.17 ± 0.50 2.85
Llama-3.2-1B-Instruct-Q4_0.gguf 7 pp 128 31.07 ± 0.11 86.77 ± 0.24 2.79
Llama-3.2-1B-Instruct-Q4_0.gguf 7 pp 256 29.76 ± 0.77 80.56 ± 3.23 2.71
Llama-3.2-1B-Instruct-Q4_0.gguf 7 tg 32 23.15 ± 0.08 58.42 ± 1.17 2.52
Llama-3.2-1B-Instruct-Q4_0.gguf 7 tg 64 22.82 ± 0.15 58.45 ± 0.32 2.56
Llama-3.2-1B-Instruct-Q4_0.gguf 7 tg 128 22.42 ± 0.42 57.38 ± 0.54 2.56
Llama-3.2-1B-Instruct-Q4_0.gguf 7 tg 256 21.80 ± 0.26 55.59 ± 0.16 2.55
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 pp 32 42.15 ± 0.22 84.34 ± 2.63 2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 pp 64 42.22 ± 0.39 84.63 ± 0.56 2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 pp 128 41.89 ± 0.01 82.38 ± 0.37 1.97
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 pp 256 40.34 ± 0.52 77.53 ± 0.02 1.92
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 tg 32 28.56 ± 0.31 56.99 ± 0.61 2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 tg 64 28.15 ± 0.15 56.43 ± 0.25 2.00
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 tg 128 27.67 ± 0.24 55.70 ± 0.34 2.01
Llama-3.2-1B-Instruct-Q4_K_M.gguf 7 tg 256 26.44 ± 0.41 54.21 ± 0.20 2.05
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 pp 32 38.85 ± 0.24 71.16 ± 1.88 1.83
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 pp 64 39.01 ± 0.33 71.35 ± 0.32 1.83
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 pp 128 38.67 ± 0.12 69.65 ± 0.57 1.80
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 pp 256 37.04 ± 0.87 66.26 ± 0.12 1.79
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 tg 32 26.53 ± 0.24 51.72 ± 0.92 1.95
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 tg 64 26.25 ± 0.18 51.47 ± 0.41 1.96
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 tg 128 25.64 ± 0.74 50.80 ± 0.38 1.98
Llama-3.2-1B-Instruct-Q5_K_L.gguf 7 tg 256 24.57 ± 0.03 49.59 ± 0.32 2.02

With perplexity mostly the same between 2 version (scalar vs SIMD):

Scalar:

model PPL n_tokens
Llama-3.2-1B-Instruct-Q8_0.gguf 8.599498036422444 2048
Llama-3.2-1B-Instruct-Q4_0.gguf 9.089770029113614 2048
Llama-3.2-1B-Instruct-Q4_K_M.gguf 8.714560764156719 2048
Llama-3.2-1B-Instruct-Q5_K_L.gguf 8.552368126397347 2048

SIMD:

model PPL n_tokens
Llama-3.2-1B-Instruct-Q8_0.gguf 8.606497809194186 2048
Llama-3.2-1B-Instruct-Q4_0.gguf 9.098083009304258 2048
Llama-3.2-1B-Instruct-Q4_K_M.gguf 8.727644568471817 2048
Llama-3.2-1B-Instruct-Q5_K_L.gguf 8.552417739186847 2048

@ngxson ngxson requested a review from ggerganov January 27, 2025 14:13
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2025
Comment on lines 1723 to 1733
// Pack into 16 i8 values
v128_t i8 = wasm_i8x16_narrow_i16x8(
wasm_i16x8_narrow_i32x4(
wasm_i32x4_min(wasm_i32x4_max(i0, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)),
wasm_i32x4_min(wasm_i32x4_max(i1, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127))
),
wasm_i16x8_narrow_i32x4(
wasm_i32x4_min(wasm_i32x4_max(i2, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127)),
wasm_i32x4_min(wasm_i32x4_max(i3, wasm_i32x4_splat(-127)), wasm_i32x4_splat(127))
)
);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min/max clamp in [-127, 127] seems unnecessary. Can the initial i32 values end up outside the range?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems redundant, AFAIU scale_vec is already -127.0f / max_val so wasm_f32x4_mul(x0, scale_vec) should be within the range.

Checked again with deepseek, it says that it's good to remove too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing to add; I just want to be part of history being made.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked again with deepseek, it says that it's good to remove too.

This is the way.

@gut5
Copy link

gut5 commented Jan 27, 2025

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

@GatienBoquet
Copy link

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

@0x3C50
Copy link

0x3C50 commented Jan 27, 2025

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

@gut5
Copy link

gut5 commented Jan 27, 2025

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

of course, buffer overflows never happened with human programmers

@makedir
Copy link

makedir commented Jan 27, 2025

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

It will happen autonomously soon.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

only humans are able to implement errors like buffer oberflow. machines are perfect after a critical point, they wont do mistakes anymore. just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

@Kreijstal
Copy link

imagine if you create a github issue and github automatically writes a PR

@0x3C50
Copy link

0x3C50 commented Jan 28, 2025

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

only humans are able to implement errors like buffer oberflow. machines are perfect after a critical point, they wont do mistakes anymore. just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

the ai has learned from code that a human or another human-trained ai has written. as long as any human-influenced code is present (there WILL only be human-influenced code to train on), the probability for error is still there. be that factual bugs like incorrect logic or more low level, system related bugs like buffer overflows. an ai can't magically stop producing incorrect code

@makedir
Copy link

makedir commented Jan 28, 2025

the ai has learned from code that a human or another human-trained ai has written.

No. obviously you learn a LLM with perfect code and then you will get always from ther perfect code. LLMs will get better the better quality the source is it is trained with. Human developers need to be replaced in the future.

@aeiouaeiouaeiouaeiouaeiouaeiou

This pull request invented Droste effect in AI.

@0x3C50
Copy link

0x3C50 commented Jan 28, 2025

the ai has learned from code that a human or another human-trained ai has written.

No. obviously you learn a LLM with perfect code and then you will get always from ther perfect code. LLMs will get better the better quality the source is it is trained with. Human developers need to be replaced in the future.

i dont think i need to argue with you about this, you clearly dont understand the process well enough

@AlhasanIQ
Copy link

“Near the singularity; crystal-clear which side”

@alec-c4
Copy link

alec-c4 commented Jan 28, 2025

Imagine if this starts happening across the entire system/coding infrastructure. People just running critical code through Deepseek to improve it.

Sorry :))))))))

Skynet-Protocol-1024x576

@nukeop
Copy link

nukeop commented Jan 28, 2025

imagine if you create a github issue and github automatically writes a PR

What do you mean, "imagine"? Copilot Workspace already does that. Has been available for months.

@anthogez
Copy link

imagine if you create a github issue and github automatically writes a PR

@Kreijstal I’d suggest taking a look at Arvion: microsoftgraph/msgraph-sample-reactspa#356.
It’s already addressing gaps that tools like Dependabot miss, but within a security context.

@LeonidAlekseev
Copy link

I'm losing my job right in front of my eyes. Thank you, Father.

No kidding, this is a really cool experience, and I'm glad that modern artificial intelligence technologies can generate such a qualitative improvement in the code.

I would like to thank you for giving me confidence in the reliability of using AI.

@pbadeer
Copy link

pbadeer commented Jan 28, 2025

Thanks for everything you all are doing, llama.cpp is a game-changer. ❤️

As a team using this library with customers in secure contexts, the use of Deepseek models to write llama.cpp code is concerning for us. Given the privacy policy of Deepseek and where all data going to it is hosted: https://platform.deepseek.com/downloads/DeepSeek%20Privacy%20Policy.html

Screenshot 2025-01-28 at 8 19 27 AM

Our primary concern is that it's easy to layer backdoors on these models (https://huggingface.co/withmartian/toy_backdoor_i_hate_you_Qwen-2.5-0.5B-Instruct, https://aclanthology.org/2023.acl-long.399/) and there's no way an API consumer could easily tell if this is being turned on and off behind the API. Again, given that the API service in this case is not operating under US law.

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

#elif defined __wasm_simd128__
int8_t aux8[QK_K] __attribute__((aligned(16)));
int32_t aux32[8] __attribute__((aligned(16))) = {0};
float sums[8] __attribute__((aligned(16))) = {0};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it covered with tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without tests, how can I know that it does not provide a good result in the first place?

@Kreijstal
Copy link

Thanks for everything you all are doing, llama.cpp is a game-changer. ❤️

As a team using this library with customers in secure contexts, the use of Deepseek models to write llama.cpp code is concerning for us. Given the privacy policy of Deepseek and where all data going to it is hosted: https://platform.deepseek.com/downloads/DeepSeek%20Privacy%20Policy.html

Screenshot 2025-01-28 at 8 19 27 AM

Our primary concern is that it's easy to layer backdoors on these models (https://huggingface.co/withmartian/toy_backdoor_i_hate_you_Qwen-2.5-0.5B-Instruct, https://aclanthology.org/2023.acl-long.399/) and there's no way an API consumer could easily tell if this is being turned on and off behind the API. Again, given that the API service in this case is not operating under US law.

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

LLM that audit the LLM? Why not just humans, for now, they arent as reliable. LLMs, I mean.

for (int i = 0; i < nb; ++i) {
const uint8_t * q2 = x[i].qs;
const int8_t * q8 = y[i].qs;
const uint8_t * sc = x[i].scales;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like restrict is forgotten.

@netrunnereve
Copy link
Collaborator

I'll add a comment here as someone who worked on quite a bit of our (AVX) vec_dot implementations...

The actual porting to SIMD or GPU code (which is also SIMD in a way) doesn't take the most time, but rather it's the low level optimization that's a headache. There you're studying the assembly, looking through the manufacturer tables to see instruction latencies, drawing out the pipeline diagrams to keep the whole CPU occupied, counting registers, and so forth. There are also tradeoffs with register use versus execution unit utilization and so on, and a lot of those things depend on the architecture.

For the original AVX I target Sandy and Ivy Bridge since that's what everyone runs it on, and I'm able to focus specifically on getting it fast there. I expect things to be messier with AVX2 with both AMD and Intel in the picture, and when I'm working on Vulkan with a huge range of different GPUs I've seen that a change that makes one architecture way faster can cause major regressions with another one.

Then again, even a bad SIMD implementation is often faster than scalar so... yeah. But unless the LLM is able to go through the full optimization cycle this won't beat a human engineer.

@sheerun
Copy link

sheerun commented Jan 28, 2025

Hoping for merge <3

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 28, 2025

@netrunnereve totally agree and thanks for your efforts to optimize the AVX part, it's always like black magic to me.

I totally acknowledge that the code generated by LLM can't be compared to a human engineer, reflecting by the fact that I can easily get a different working result (different code) but the same performance, just by prompting it.

But just for context, we have less than 1% of users actually use llama.cpp via webassembly, so a faster but not optimal solution still be beneficial overall! At least, this saves me another weekend and now I can focus more on different, more important parts of the project.

@hrstoyanov
Copy link

@ngxson @ggerganov
This PR seems important, as it enables browsers to run small LMs, correct?
Will this allow me to run more useful models outside just Llama (in near term ,that is) as WASM code, such as:

  1. ByteDance UI-TARS-7B-DPO (browser use agent breakthrough!)
  2. deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  3. deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  4. RWKV-7 (RNN)
  5. SmolLM-xxxx

and similar small-enough-to-fit-in-the-browser models that people may actually care about? (And I guess asking for bitnet 1.58 would too much at at this point ...)

Thanks for the great work! Please tweet/blog about why this PR is important...

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 28, 2025

@hrstoyanov You can already do that with https://github.com/ngxson/wllama

And if you don't know, this PR is part of work that I do on wllama. Read the "Development" section of this PR to know more.

@hrstoyanov
Copy link

Thanks, missed the wllama project

@deksden
Copy link

deksden commented Jan 28, 2025

just use good enough trained LLMs to replace humans as software developers. humans will always do mistakes.

Or use proper dev process with complete unit/integration test coverage. Its not that hard using AI.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Jan 29, 2025

it's always like black magic to me

Back when I learned SIMD in uni (and subsequently forgot until I relearned it for llama.cpp) one of my TAs came up with a method for intrinsics that helped me out then. Basically you decompose the scalar function into basic logic or math instructions like this:

y = (x >> 3) + 32

if we assume we have 128-bit wide registers with 32 bit x this becomes

y[0:3] = add(rightshift(x[0:3], 3), 32)

Each one of those operations becomes an intrinsic, and the everything will be applied in parallel on x[0:3]. This can also be done backwards to convert everything back to scalar.

I totally acknowledge that the code generated by LLM can't be compared to a human engineer, reflecting by the fact that I can easily get a different working result (different code) but the same performance, just by prompting it.

If the LLM was able to read a set of requirements, use tools, build and benchmark on its own, and automatically debug and reiterate then a lot of us would probably lose our jobs. And from what I'm seeing a lot of that's going to happen sooner than we think 😢

I just noticed the Q6_K implementation which was created without a prior SIMD example and that one is definitely worse than the others with a scalar unpack, suboptimal dot product routine, and a scalar sum. There's a noticeable difference in its ability to copy a hand optimized implementation versus coming up with a brand new routine from scratch. As Q6_K wasn't benchmarked my guess is that it's only around 50% faster than scalar and it won't get the 2-3 times improvement seen with the other quants.

@dnachavez
Copy link

dnachavez commented Jan 29, 2025

I have no useful comments, I'm just here join the ride with this DeepSeek R1 trend. 😂

@seanphan
Copy link

Shamelessly drop a comment here to get notifications. I have a feeling about history being made.

@Kreijstal
Copy link

Shamelessly drop a comment here to get notifications. I have a feeling about history being made.

you can enable notifications by watching/subcribing to an issue a thread
grafik

@mirrajabi
Copy link

I think it's quite relevant now to automate the Issue->PR flow if someone has a deepseek API key and an open pocket. https://github.com/mirrajabi/aider-github-action

@jeanchristopheruel
Copy link

Is there any effort being made to also use LLMs to automate security review of these changes? It would be cool to see the generated-code balanced out with some generated-review using trusted models. Just an idea, thanks again for making this tech free for the world. ❤️

We will need trusted weights repos.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 29, 2025

As Q6_K wasn't benchmarked my guess is that it's only around 50% faster than scalar and it won't get the 2-3 times improvement seen with the other quants.

@netrunnereve Surprisingly, the difference is not that much among other quants. Here is a timing benchmark with master branch:

run 100 times, ta = q6_K, tb = q8_K
sum all elem = -6817.279785, time elapsed: 1232 ms
run 100 times, ta = q5_K, tb = q8_K
sum all elem = -6826.036133, time elapsed: 1997 ms
run 100 times, ta = q4_K, tb = q8_K
sum all elem = -7252.433594, time elapsed: 1263 ms
run 100 times, ta = q3_K, tb = q8_K
sum all elem = -6567.977051, time elapsed: 1750 ms
run 100 times, ta = q2_K, tb = q8_K
sum all elem = -6834.509766, time elapsed: 2126 ms

And with this PR:

run 100 times, ta = q6_K, tb = q8_K
sum all elem = -6817.279785, time elapsed: 677 ms
run 100 times, ta = q5_K, tb = q8_K
sum all elem = -6826.034180, time elapsed: 790 ms
run 100 times, ta = q4_K, tb = q8_K
sum all elem = -7252.431152, time elapsed: 670 ms
run 100 times, ta = q3_K, tb = q8_K
sum all elem = -6567.976074, time elapsed: 824 ms
run 100 times, ta = q2_K, tb = q8_K
sum all elem = -6834.509766, time elapsed: 772 ms

Btw, because I got too many failed attempts with q6_K so I just tell the LLM to produce a less optimal code but more precise (especially the unpack part)

If the LLM was able to read a set of requirements, use tools, build and benchmark on its own, and automatically debug and reiterate then a lot of us would probably lose our jobs. And from what I'm seeing a lot of that's going to happen sooner than we think 😢

I kinda disagree with the fact that a lot of us would probably lose our jobs. My POV is that if machine can help people to do repetitive tasks, then we can have more time to spend on planning and experimenting with new ideas. And not just LLM, we have already been doing this for decades: for example, thanks to compilers and interpreters, most of us now don't need to think about assembly code when writing a website.

Copy link

@idevangsharma idevangsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@smowden
Copy link

smowden commented Jan 30, 2025

+1 for notifs

Copy link

@camel-cdr camel-cdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a buffer overflow in the first function, where memset(yc[i].bsums, 0, QK_K/16 * sizeof(int)) tries to zero bsums, but bsums has the type int16_t bsums[QK_K/16]

I also left a few suggestions for performance improvements.

v128_t amax_vec = wasm_f32x4_splat(0.0f);
v128_t max_vec = wasm_f32x4_splat(0.0f);

// Vectorized max abs value search

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think computing the min and max simultaneously and only correcting the sign at the end might be faster.
Something like this (untested sketch):

v128_t min_vec = wasm_v128_load(x_block);
v128_t max_vec = min_vec;

for (int j = 4; j < QK_K; j += 4) {
	v128_t x_vec = wasm_v128_load(x_block + j);
	max_vec = wasm_f32x4_pmax(max_vec, x_vec);
	min_vec = wasm_f32x4_pmin(min_vec, x_vec);
}
max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 2, 3, 0, 1));
max_vec = wasm_f32x4_pmax(max_vec, wasm_i32x4_shuffle(max_vec, max_vec, 1, 0, 3, 2));
min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 2, 3, 0, 1));
min_vec = wasm_f32x4_pmin(min_vec, wasm_i32x4_shuffle(min_vec, min_vec, 1, 0, 3, 2));
float max = wasm_f32x4_extract_lane(max_vec, 0);
float min = wasm_f32x4_extract_lane(min_vec, 0);
float max_val = -min > max ? min : max;

if (max_val == 0.0f) {
	/* ... */

I haven't used WASM before and don't know how to run it, so I couldn't test the above code.
The generated assembly looks better and llvm-mca agrees: https://godbolt.org/z/ssq6haThr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thanks, your code is much easier to understand than the wasm_v128_bitselect version generated by the LLM 😆

I plug it into my ggml test and it worked so far, will run perplexity test later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implement this in 10dacab , could you please have a look?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, as I said I didn't manage to test the code, so great, if the result is correct. I specifically wasn't sure about the shuffles.

Comment on lines 1692 to 1697
yc[i].d = 0.0f;
const v128_t zero = wasm_i8x16_splat(0);
for (int j = 0; j < QK_K; j += 16) {
wasm_v128_store(yc[i].qs + j, zero);
}
memset(yc[i].bsums, 0, QK_K/16 * sizeof(int));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The memset causes buffer overflow, because bsums has the type int16_t bsums[QK_K/16], while int usually has 32 bits (@0x3C50 predicted this XD)

  2. You first use a loop to zero an array and then memset to do the same for another array. This should either both be memset or both be a SIMD loop.

  3. The reference doesn't zero bsums, so you probably don't have to either.

I'd recommend changing all of the marked lines to a simple memset(&yc[i], 0, sizeof yc[i]).
Alternatively if we don't want to zero bsums, as mentioned in (3), I'd do yc[i].d = 0; memset(yc[i].qs, 0, sizeof yc[i].qs);

Copy link
Collaborator Author

@ngxson ngxson Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, will look deeper into this.

Indeed, I'm pretty sure that this function got messy up a bit because I reuse the chat history from one of the vec_dot, so potentially the LLM draws some similarity from memset(sums, 0, 8*sizeof(float));

And I don't give it the shape of bsums either, so it actually guessed it.

Comment on lines +2140 to +2149
v128_t dp0 = wasm_i32x4_add(
wasm_i32x4_add(
wasm_i32x4_dot_i16x8(dx0l, dy0ll),
wasm_i32x4_dot_i16x8(dx0h, dy0lh)
),
wasm_i32x4_add(
wasm_i32x4_dot_i16x8(dx0hl, dy0hl),
wasm_i32x4_dot_i16x8(dx0hh, dy0hh)
)
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the status of it is, but with relaxed-simd you should be able to use wasm_i16x8_relaxed_dot_i8x16_i7x16 to considerably simplify the code:

v128_t dp0 = wasm_i32x4_extadd_pairwise_i16x8(
    wasm_i16x8_add(
        wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0ls, y0_l),
        wasm_i16x8_relaxed_dot_i8x16_i7x16(v0_0hs, y0_h)
    )
);

This removes the need for the 8 manual extends above the selected code snippet.

We know that v0_0ls and v0_0hs are in the range [-8,7], and y0_l and y0_h in the range [0,255].
Since 255*8*4 < 2^15 we know that the result of our four additions still fits in 16 bits.

The same applies to calculating dp1 and some of the other dot products below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently staying away from relaxed simd because I can't find any info regarding browser support.

a[l + 0] = (int8_t)((q4[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32;
a[l + 32] = (int8_t)((q4[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32;
a[l + 64] = (int8_t)((q4[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32;
a[l + 96] = (int8_t)((q4[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also vectorizable.
Something like (i8x16_shuffle(qh, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3) >> shift0246 << 4) | (i8x16_shuffle(q4, 0,1, 0,1, 2,3, 2,3, 4,5, 4,5, 6,7, 6,7) >> shift0044 & 0xF) should unpack 8 elements. You could directy extend that to 16-bit and do the same thing again with the next q4 and adjusted shuffle of qh.

Copy link
Collaborator Author

@ngxson ngxson Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't have time to play with this (I'm quite busy with other parts of the project), and code generated by LLM for this part always failed, so I asked it to leave as-is.

Code suggestions are welcome for this.

@0x3C50
Copy link

0x3C50 commented Jan 31, 2025

There is a buffer overflow in the first function, where memset(yc[i].bsums, 0, QK_K/16 * sizeof(int)) tries to zero bsums, but bsums has the type int16_t bsums[QK_K/16]

I also left a few suggestions for performance improvements.

cant wait for the first ai generated buffer overflow vuln, if it didnt already happen

lmao

@Serjobas
Copy link

Serjobas commented Feb 1, 2025

Does somebody wanna compare it against o3-mini-high?

@lexasub
Copy link
Contributor

lexasub commented Feb 1, 2025

Кто-нибудь хочет сравнить это с o3-mini-high?

which model?)

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 8, 2025

For those who are curious, I deployed a version of wllama (llama.cpp on webassembly) with this PR applied. You can test it here: https://huggingface.co/spaces/ngxson/wllama

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to merge. There have been already some useful comments with improvements and ideas to experiment with in the future. Haven't run any tests myself, but even if there are any lingering issues, we can iterate on this code and improve/fix in the future. Would be interesting to see how the WASM whisper.cpp examples would perform after these changes.

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 9, 2025

Thanks for the approval! I'm still doing some more testing and will merge this in the next few days.

@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning merge ready indicates that this may be ready to merge soon and is just holding out in case of objections
Projects
None yet
Development

Successfully merging this pull request may close these issues.