FP8 / INT8 tensor core support #612

Leikoe · 2023-11-15T17:03:38Z

Leikoe
Nov 15, 2023

Hi, currently ggml uses cublas_v2 api for cublasGemmStridedBatchedEx and cublasGemmBatchedEx. Would it be possible to switch to cublasLt api and use their support for INT8 and (new) FP8 tensor cores gemm ?

cmp-nct · 2023-11-20T15:24:04Z

cmp-nct
Nov 20, 2023

For a start you can take a look here: https://github.com/cmp-nct/ggllm.cpp/blob/ggfalcon_dev/ggml-cuda.cu
That's what I made for falcon inference project I had running, I couldn't keep up with development changes of llama.cpp but my project still has quite a few advantages, one of them is FP8 support.
Includes kernels for conversion for fp8 and a wrapper for (non batched) cublas, all was working without issues.

FP8 was a significant speed boost on my 4090 compared to FP16

0 replies

divisuals · 2023-11-26T19:23:37Z

divisuals
Nov 26, 2023

New here; thought of sharing a relevant development:

Intel is working on speeding-up GGML with lower-level optimizations:
https://arxiv.org/pdf/2311.00502.pdf

Code (LLM runtime is here):
https://github.com/intel/intel-extension-for-transformers

[2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 / INT8 tensor core support #612

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

FP8 / INT8 tensor core support #612

Leikoe Nov 15, 2023

Replies: 2 comments

cmp-nct Nov 20, 2023

divisuals Nov 26, 2023

Leikoe
Nov 15, 2023

cmp-nct
Nov 20, 2023

divisuals
Nov 26, 2023