[BOUNTY - $500] Add support for quantized models with tinygrad #148

AlexCheema · 2024-08-13T10:15:45Z

The deliverable here is to be able to run existing quantized models with the tinygrad inference engine
Bonus (+$200) bounty as an easy follow up is to add support for MLX community models: [BOUNTY - $200] Support MLX community models in tinygrad inference engine #200

barsuna · 2024-08-30T07:17:52Z

(I'm not in any way positioned to implement the quantization support, but wanted to share some notes with those planning to work on it)

background: I thought tinygrad example already has some quantization support, how hard could it be to get it over to exo :) So i copied over the int8 and nf4 code, updated the create_transformer functions etc - and indeed can see that it sort of works conceptually (tried on both llama3.1. 8B and 70B).

But a few things need to be sorted for this to be usable (and user-friendly)

Currently mapping partitions/shards is done at layer granularity and it seems that exo does not take into account the particular size of the layer. For large models (e.g. 70B and larger - that leads to large rounding errors and if GPU space available to the whole cluster is very tight - there are GPUs with free memory + there GPUs with memory loaded to the brim (or OOM). I had temporarily worked it around by overriding memory discovery to manually specify how much memory is available where, but this is clearly not a solution. This should also take care of the fact that applying quantization is not uniform - some layers shrink, others do not.
Related to above - even if quantizing models to 4 bits, the tinygrad 1st loads the model as is (with 16 bit weights) and those 16bit weights still stay in memory somehow (even though i'm not sure they are used for anything, they are not gc'ed) - so memory consumption on each host is very large: ~64GB of RAM for 24GB worth of GPU space - this doesn't change between default, int8, nf4... Possible solution here is to save quantized model and load already quntized weights (this is what llama.cpp does for ex.)
Indirectly related: if one has >1 GPU in a host, we need some discovery update such that one can run >2 instances including 2 on the same host. Many large GPU machines have 8 GPUs, so may need as many as 8 instances per host.

For documentation purposes, to run llama3.1 70B at NF4, i've used 3 hosts with 64GB GPU ram between them and the model fits only just. Looks like NF4 is skipping many layers etc - so even quantized 70B model is still quite large.

Resulting tokens / sec was very low: at about 0.5 TPS I think (but i had to also disable JIT on tinygrad else some GPUs were throwing errors - so perhaps the performance in not representative). As reference point I'm able to get >7 tokens/sec if put 3 of these GPUs into 1 machine and run with llama.cpp). On CPU inference on same hardware is ~0.8. But again providing these numbers just for reference, performance discussion is obviosly premature at this point.

For the record, the command lines for each node:

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 GPU_MEM_MB=28000 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 1111 --quantize nf4 --node-port 10001 --discovery-timeout 3600

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 CUDA_VISIBLE_DEVICES=0 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 2222 --quantize nf4 --node-port 10002 --discovery-timeout 3600 --broadcast-port 5680

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 CUDA_VISIBLE_DEVICES=1 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 3333 --quantize nf4 --node-port 10003 --chatgpt-api-port 7999 --discovery-timeout 3600 --broadcast-port 5679 --listen-port 10003

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 GPU_MEM_MB=12000 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 4444 --quantize nf4 --node-port 10004 --discovery-timeout 3600 --broadcast-port 5681

(ASSISTED_DISCOVERY and GPU_MEM_MB are the modification made for points 2 and 3 above)

  _____  _____  
 / _ \ \/ / _ \ 
|  __/>  < (_) |
 \___/_/\_\___/ 
    
Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader
Chat interface started:
 - http://127.0.0.1:8000
 - http://192.168.0.210:8000
ChatGPT API endpoint served at:
 - http://127.0.0.1:8000/v1/chat/completions
 - http://192.168.0.210:8000/v1/chat/completions
...
Removing download task for Shard(model_id='NousResearch/Meta-Llama-3.1-70B-Instruct', start_layer=0, end_layer=33, n_layers=80): True
ram used: 17.56 GB, freqs_cis                                         : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1284/1284 [00:00<00:00, 62425.23it/s]
loaded weights in  22.22 ms, 0.00 GB loaded at 0.00 GB/s
Hello
...
Hello! How can I assist you today?<|eot_id|>

AlexCheema · 2024-09-05T12:10:51Z

(I'm not in any way positioned to implement the quantization support, but wanted to share some notes with those planning to work on it)

background: I thought tinygrad example already has some quantization support, how hard could it be to get it over to exo :) So i copied over the int8 and nf4 code, updated the create_transformer functions etc - and indeed can see that it sort of works conceptually (tried on both llama3.1. 8B and 70B).

But a few things need to be sorted for this to be usable (and user-friendly)

Currently mapping partitions/shards is done at layer granularity and it seems that exo does not take into account the particular size of the layer. For large models (e.g. 70B and larger - that leads to large rounding errors and if GPU space available to the whole cluster is very tight - there are GPUs with free memory + there GPUs with memory loaded to the brim (or OOM). I had temporarily worked it around by overriding memory discovery to manually specify how much memory is available where, but this is clearly not a solution. This should also take care of the fact that applying quantization is not uniform - some layers shrink, others do not.

Related to above - even if quantizing models to 4 bits, the tinygrad 1st loads the model as is (with 16 bit weights) and those 16bit weights still stay in memory somehow (even though i'm not sure they are used for anything, they are not gc'ed) - so memory consumption on each host is very large: ~64GB of RAM for 24GB worth of GPU space - this doesn't change between default, int8, nf4... Possible solution here is to save quantized model and load already quntized weights (this is what llama.cpp does for ex.)

Indirectly related: if one has >1 GPU in a host, we need some discovery update such that one can run >2 instances including 2 on the same host. Many large GPU machines have 8 GPUs, so may need as many as 8 instances per host.

For documentation purposes, to run llama3.1 70B at NF4, i've used 3 hosts with 64GB GPU ram between them and the model fits only just. Looks like NF4 is skipping many layers etc - so even quantized 70B model is still quite large.

Resulting tokens / sec was very low: at about 0.5 TPS I think (but i had to also disable JIT on tinygrad else some GPUs were throwing errors - so perhaps the performance in not representative). As reference point I'm able to get >7 tokens/sec if put 3 of these GPUs into 1 machine and run with llama.cpp). On CPU inference on same hardware is ~0.8. But again providing these numbers just for reference, performance discussion is obviosly premature at this point.

For the record, the command lines for each node:

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 GPU_MEM_MB=28000 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 1111 --quantize nf4 --node-port 10001 --discovery-timeout 3600

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 CUDA_VISIBLE_DEVICES=0 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 2222 --quantize nf4 --node-port 10002 --discovery-timeout 3600 --broadcast-port 5680

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 CUDA_VISIBLE_DEVICES=1 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 3333 --quantize nf4 --node-port 10003 --chatgpt-api-port 7999 --discovery-timeout 3600 --broadcast-port 5679 --listen-port 10003

JIT=0 DEBUG=0 ASSISTED_DISCOVERY=1 GPU_MEM_MB=12000 CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --node-id 4444 --quantize nf4 --node-port 10004 --discovery-timeout 3600 --broadcast-port 5681

(ASSISTED_DISCOVERY and GPU_MEM_MB are the modification made for points 2 and 3 above)
  _____  _____  
 / _ \ \/ / _ \ 
|  __/>  < (_) |
 \___/_/\_\___/ 
    
Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader
Chat interface started:
 - http://127.0.0.1:8000
 - http://192.168.0.210:8000
ChatGPT API endpoint served at:
 - http://127.0.0.1:8000/v1/chat/completions
 - http://192.168.0.210:8000/v1/chat/completions
...
Removing download task for Shard(model_id='NousResearch/Meta-Llama-3.1-70B-Instruct', start_layer=0, end_layer=33, n_layers=80): True
ram used: 17.56 GB, freqs_cis                                         : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1284/1284 [00:00<00:00, 62425.23it/s]
loaded weights in  22.22 ms, 0.00 GB loaded at 0.00 GB/s
Hello
...
Hello! How can I assist you today?<|eot_id|>

Thanks @barsuna this is super helpful for implementers.

I've added a $200 bounty as this seems like an important addition to exo. Also added to the bounties sheet: #148

varshith15 · 2024-09-07T17:15:45Z

checking this, good chance to explore tinygrad :)

RashikShahjahan · 2024-10-30T14:17:09Z

@varshith15 Any progress on this? Or I can take over

RashikShahjahan · 2024-11-04T22:00:34Z

@AlexCheema I made a PR #413 but I haven't tested with the 70B model yet

varshith15 · 2024-11-04T22:12:16Z

@RashikShahjahan i had done the same thing you did a bit ago b7b911d

but thats not whats expected, the idea is not quantize the models on the fly, the expectation is to run existing already quantized models (mlx, bnb etc) on tinygrad like this https://github.com/exo-explore/exo/pull/213/files (it is a bit slow) so that there is interoperability between machines

don't mind working on this together, ping me if on discord if you got ideas :)

RashikShahjahan · 2024-11-04T22:31:23Z

@varshith15 Thanks for pointing out! I got tripped up by the tinygrad example. Did you need help with anything in particular? I might just pick another issue now that I have a better understanding of the codebase

varshith15 · 2024-11-05T13:27:26Z

@RashikShahjahan i have been busy and not been able to work on it, the specific requirement is to figure out how tinygrad generates mat_mul code and to see how to get tinygrad to write optimized quantized_mat_mul for mlx, bnb(awq, nf4 etc)

AlexCheema · 2024-11-14T12:03:53Z

Bumping this up to $500 - this would be a great thing to have.
By some magic I don't fully understand, we already have full interoperability between MLX and tinygrad even if the quantization doesn't match! So you can run a model on fp16 with tinygrad and a model on 4-bit with MLX and it just works.
However, it would still be great to support other quantisations with tinygrad.

pickettd · 2024-11-15T22:00:42Z

@AlexCheema just to double check, are you saying that you can have a cluster running where one machine has a model on 4-bit with MLX and the other machine is running fp16 with tinygrad and inference works? I thought I read in the code that if one machine in the cluster picks the tinygrad engine then the whole cluster switches to tinygrad engine. I've had trouble in my cluster finding a working test config for Linux with Nvidia with Mac with MLX (I would love to use Qwen2.5 72B or Qwen2.5-Coder 32B distributed but in the https://github.com/exo-explore/exo/blob/main/exo/models.py file it looks like Qwen is just setup for MLX and back when I was messing with 72B I couldn't figure out a way to get Tinygrad to work with it)

KhanerX · 2025-01-25T15:31:02Z

Hi, I started working on this, my WIP PR is #630 .

As @varshith15 pointed out, naively dequantizing weights before each forward is not performant, however this is not an issue with tinygrad's kernels. inference is slow because in each forward, we are adding O(2n^2) worth of muls and adds, since we are multiplying each weights with it's scale and adding the biases(zero_points).

This paper shows how to:

Decrease overhead of adding the biases by adding them to input instead of the weights (section 2.3)
Do compute in integer instead of float (section 2.2)

My PR implements (1), I'll look into (2) as well.
@AlexCheema I'd appreciate if you take a look and assign this to me, thanks.

AlexCheema added the enhancement New feature or request label Aug 13, 2024

AlexCheema mentioned this issue Aug 13, 2024

Issues loading model shards #140

Open

AlexCheema changed the title ~~tinygrad support quantized models~~ [BOUNTY - $200] tinygrad support quantized models Sep 5, 2024

AlexCheema changed the title ~~[BOUNTY - $200] tinygrad support quantized models~~ [BOUNTY - $200] Add support for quantized models with tinygrad Sep 5, 2024

AlexCheema changed the title ~~[BOUNTY - $200] Add support for quantized models with tinygrad~~ [BOUNTY - $300] Add support for quantized models with tinygrad Sep 5, 2024

This was referenced Sep 5, 2024

[BOUNTY - $200] Support MLX community models in tinygrad inference engine #200

Open

Question about Exo #199

Closed

varshith15 mentioned this issue Oct 17, 2024

Tinygrad quantization support #213

Draft

AlexCheema changed the title ~~[BOUNTY - $300] Add support for quantized models with tinygrad~~ [BOUNTY - $500] Add support for quantized models with tinygrad Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOUNTY - $500] Add support for quantized models with tinygrad #148

[BOUNTY - $500] Add support for quantized models with tinygrad #148

AlexCheema commented Aug 13, 2024 •

edited

Loading

barsuna commented Aug 30, 2024

AlexCheema commented Sep 5, 2024

varshith15 commented Sep 7, 2024 •

edited

Loading

RashikShahjahan commented Oct 30, 2024

RashikShahjahan commented Nov 4, 2024 •

edited

Loading

varshith15 commented Nov 4, 2024 •

edited

Loading

RashikShahjahan commented Nov 4, 2024

varshith15 commented Nov 5, 2024

AlexCheema commented Nov 14, 2024

pickettd commented Nov 15, 2024

KhanerX commented Jan 25, 2025

[BOUNTY - $500] Add support for quantized models with tinygrad #148

[BOUNTY - $500] Add support for quantized models with tinygrad #148

Comments

AlexCheema commented Aug 13, 2024 • edited Loading

barsuna commented Aug 30, 2024

AlexCheema commented Sep 5, 2024

varshith15 commented Sep 7, 2024 • edited Loading

RashikShahjahan commented Oct 30, 2024

RashikShahjahan commented Nov 4, 2024 • edited Loading

varshith15 commented Nov 4, 2024 • edited Loading

RashikShahjahan commented Nov 4, 2024

varshith15 commented Nov 5, 2024

AlexCheema commented Nov 14, 2024

pickettd commented Nov 15, 2024

KhanerX commented Jan 25, 2025

AlexCheema commented Aug 13, 2024 •

edited

Loading

varshith15 commented Sep 7, 2024 •

edited

Loading

RashikShahjahan commented Nov 4, 2024 •

edited

Loading

varshith15 commented Nov 4, 2024 •

edited

Loading