Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何将gpu显卡性能发挥到最大 #549

Open
txg1550759 opened this issue Feb 20, 2025 · 10 comments
Open

如何将gpu显卡性能发挥到最大 #549

txg1550759 opened this issue Feb 20, 2025 · 10 comments

Comments

@txg1550759
Copy link

txg1550759 commented Feb 20, 2025

我用的2块 Intel(R) Xeon(R) Gold 5320 CPU 56核102线程,2块A800 80G显卡 nvlink桥接器桥接,512G DDR4 内存,不管是数学,还是语文模型处理速度:1、输入是prompt 11-13tokens每秒左右 2、输出5.5 tokens每秒 。

prompt eval count: 60 token(s)
prompt eval duration: 3.1178791522979736s
prompt eval rate: 19.243850408948063 tokens/s
eval count: 3673 token(s)
eval duration: 690.929929971695s
eval rate: 5.316023869671517 tokens/s

分配72核都能用完,发现cpu使用很高,内存占用38G, 显存占用10%,利用率10%左右。
看起来,是cpu在计算,显卡资源利用率很低,怎么调优,把显卡的剩下的90%能力用起来?

我用的模型是DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M,我的启动参数是这样,用docker跑的0.21:
python -m ktransformers.local_chat --gguf_path "/models" --model_path "/models" --cpu_infer 72 --max_new_tokens 50000 --optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml

Thu Feb 20 21:34:38 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800 80GB PCIe On | 00000000:17:00.0 Off | 0 |
| N/A 56C P0 118W / 300W | 7633MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe On | 00000000:65:00.0 Off | 0 |
| N/A 50C P0 107W / 300W | 11083MiB / 81920MiB | 0% Default |
| | | Disabled |

@sweihub
Copy link

sweihub commented Feb 20, 2025

我也想要这个答案,ktransformers 团队为什么把显存的占用写死在 16G 以下呢? 我们的目标不是榨干硬件的性能吗?如果多GPU支持有难度,把单GPU显存需求变成可配置应该可行吧?

@BiFangKNT
Copy link
Contributor

你这个内存占用很明显是错的

@ginghalo
Copy link

明显是用了DDR4导致内存带宽成了限制速度的瓶颈了吧

@oho-work
Copy link

你这个内存占用很明显是错的

你测试的时候内存占用很多吗?我当时只占用了十几个G的内存。
用free -mh命令,只发现buff/cache占用了四百多GB的内存,used并不多。

@ginghalo
Copy link

你这个内存占用很明显是错的

你测试的时候内存占用很多吗?我当时只占用了十几个G的内存。 用free -mh命令,只发现buff/cache占用了四百多GB的内存,used并不多。

buff/cache和used的加起来才是真正的内存占用,都是一样的情况。翻翻之前的issue都有反应这一情况

@yhfgyyf
Copy link

yhfgyyf commented Feb 21, 2025

我也想要这个答案,ktransformers 团队为什么把显存的占用写死在 16G 以下呢? 我们的目标不是榨干硬件的性能吗?如果多GPU支持有难度,把单GPU显存需求变成可配置应该可行吧?

可以自己写yaml文件,把mlp层放一部分到显卡上,但是这样对性能并没有什么提升,因为瓶颈在cpu和内存上

@txg1550759
Copy link
Author

提高显卡性能,将前面20层分配到gpu运行
KExperts 分配:
将前 10 层(0 - 9)的 KExperts 相关模块(如 mlp.experts)分配到 cuda:0。
将第 10 - 19 层的 KExperts 相关模块分配到 cuda:1。这样每个 GPU 最多处理 10 个 KExperts,理论上 KExperts 占用的内存约为
60G,考虑到其他模块的少量内存占用,2块gpu显存控制在 50G 左右。

用法 --optimize_rule_path ./ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

大家有兴趣一起测试,yaml配置如下:

cat DeepSeek-V3-Chat-multi-gpu-new.yaml

  • match:
    name: "^model.embed_tokens"
    replace:
    class: "default"
    kwargs:
    generate_device: "cpu"
    prefill_device: "cpu"
  • match:
    name: "^model\.layers\.(0|[1-9])\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
    replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
    replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
  • match:
    name: "^model\.layers\.(0|[1-9])\.(?!self_attn\.kv_b_proj).*$"
    class: torch.nn.Linear
    replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
    generate_op: "KLinearMarlin"
    prefill_op: "KLinearTorch"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\.(?!self_attn\.kv_b_proj).*$"
    class: torch.nn.Linear
    replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
    generate_op: "KLinearMarlin"
    prefill_op: "KLinearTorch"
  • match:
    name: "^model\.layers\.(0|[1-9])\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
    replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
    replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
  • match:
    name: "^model\.layers\.(0|[1-9])\.mlp\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
    replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\.mlp\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
    replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
  • match:
    name: "^model\.layers\.(0|[1-9])\.mlp\.experts$"
    replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
    prefill_device: "cuda:0"
    prefill_op: "KExpertsTorch"
    generate_device: "cuda:0"
    generate_op: "KExpertsMarlin"
    out_device: "cuda:0"
    recursive: false
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\.mlp\.experts$"
    replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
    prefill_device: "cuda:1"
    prefill_op: "KExpertsTorch"
    generate_device: "cuda:1"
    generate_op: "KExpertsMarlin"
    out_device: "cuda:1"
    recursive: false
  • match:
    name: "^model\.layers\.(0|[1-9])\.self_attn$"
    replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\.self_attn$"
    replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
  • match:
    name: "^model$"
    replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
    per_layer_prefill_intput_threshold: 0
    transfer_map:
    10: "cuda:1"
  • match:
    name: "^model\.layers\.(0|[1-9])\."
    replace:
    class: "default"
    kwargs:
    generate_device: "cuda:0"
    prefill_device: "cuda:0"
  • match:
    name: "^model\.layers\.(10|11|12|13|14|15|16|17|18|19)\."
    replace:
    class: "default"
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"
  • match:
    name: "(model.norm)|(lm_head)"
    replace:
    class: "default"
    kwargs:
    generate_device: "cuda:1"
    prefill_device: "cuda:1"

@sweihub
Copy link

sweihub commented Feb 21, 2025

@txg1550759 文件格式全乱了,贴上来用三个反点(`)包起来试试。

@mitometa
Copy link

当前版本确实有这个问题,CPU 消耗大,没有充分利用 GPU 功能;如果能再支持 --force_RAG,就完美了。

@snailfrying
Copy link

我也是,当前kt版本,对于单cpu和单GPU的配合下,能够运行deepseek-r1.但是速度很慢。确资源只是cpu占满。对于想优化速度来说。也不知从何下手,对于配置更高的cpu和gpu或者优化kt运行参数来说,有没有大佬分享一下,如何达到15 tokens/s的解决方案

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants