Replies: 2 comments
-
For a start you can take a look here: https://github.com/cmp-nct/ggllm.cpp/blob/ggfalcon_dev/ggml-cuda.cu FP8 was a significant speed boost on my 4090 compared to FP16 |
Beta Was this translation helpful? Give feedback.
-
New here; thought of sharing a relevant development: Intel is working on speeding-up GGML with lower-level optimizations: Code (LLM runtime is here):
|
Beta Was this translation helpful? Give feedback.
-
Hi, currently ggml uses cublas_v2 api for cublasGemmStridedBatchedEx and cublasGemmBatchedEx. Would it be possible to switch to cublasLt api and use their support for INT8 and (new) FP8 tensor cores gemm ?
Beta Was this translation helpful? Give feedback.
All reactions