-
Notifications
You must be signed in to change notification settings - Fork 581
llamacpp_en
This wiki will walk you through the detailed steps of model quantization and local deployment using the llama.cpp tool. Windows users may need to install a compiler such as cmake. For quick local deployment, it is recommended to use the instruction-fine-tuned Alpaca-2 model. If possible, use the 6-bit or 8-bit models for better results. The following instructions use the Chinese Alpaca-2-7B model as an example. Before running, ensure that:
- Your system has the
make
(built-in for MacOS/Linux) orcmake
(needs to be installed on Windows) compiler tool - It is recommended to use Python 3.10 or above to compile and run this tool
- (Optional) If you have downloaded an old repository, it is recommended to pull the latest code with
git pull
and clean it withmake clean
- Pull the latest code from the llama.cpp repository
$ git clone https://github.com/ggerganov/llama.cpp
- Compile the llama.cpp project to generate the
./main
(for inference) and./quantize
(for quantization) binary files.
$ make
-
Windows/Linux users: It is recommended to compile with BLAS (or cuBLAS if you have a GPU) to speed up prompt processing. See llama.cpp#blas-build for reference.
-
macOS users: No additional steps required. llama.cpp is optimized for ARM NEON and has BLAS enabled by default. For M-series chips, it is recommended to use Metal for GPU inference to significantly improve speed. To do this, change the compilation command to
LLAMA_METAL=1 make
. See llama.cpp#metal-build for reference.
(💡 You can also directly download quantized model:gguf models)
Convert the full model weights (either .pth
format or huggingface .bin
format) to GGML's FP16 format, generating a file at zh-models/7B/ggml-model-f16.gguf
. Further quantize the FP16 model to a 4-bit model, generating a quantized model file at zh-models/7B/ggml-model-q4_0.gguf
. The performance comparison of different quantization methods can be found at the end of this wiki.
$ python convert.py zh-models/7B/
$ ./quantize ./zh-models/7B/ggml-model-f16.gguf ./zh-models/7B/ggml-model-q4_0.gguf q4_0
Since the Alpaca-2 launched by this project uses the instruction template of Llama-2-chat, please first copy scripts/llama-cpp/chat.sh
of this project to the root directory of llama.cpp. The content of the chat.sh
file is as follows, and the chat template and some default parameters are nested inside, which can be modified according to the actual situation. The --rope-scale
is no longer needed to be set manually, as it is explicitly considered in the conversion process. You can just load long-context model as a normal one.
- GPU Inference: if compiled with cuBLAS/Metal, specify the offload layers, e.g.,
-ngl 40
means offloading 40 layers of model parameters to GPU
#!/bin/bash
# temporary script to chat with Chinese Alpaca-2 model
# usage: ./chat.sh alpaca2-ggml-model-path your-first-instruction
SYSTEM='You are a helpful assistant. 你是一个乐于助人的助手。'
FIRST_INSTRUCTION=$2
./main -m $1 \
--color -i -c 4096 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 \
--in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>
$FIRST_INSTRUCTION [/INST]"
Start chatting with the following command.
$ chmod +x chat.sh
$ ./chat.sh zh-models/7B/ggml-model-q4_0.gguf '请列举5条文明乘车的建议'
After the >
prompt, enter your prompt. Use cmd/ctrl+c
to interrupt output, and use \
at the end of the line for multiline input. For help and parameter descriptions, execute the ./main -h
command. Here are some common parameters:
-c Controls the length of the context. The larger the value, the longer the conversation history it can refer to (default: 512)
-ins Enables the instruction mode for ChatGPT-like conversation
-f Specifies the prompt template. For the alpaca model, load prompts/alpaca.txt
-n Controls the maximum length of the reply generated (default: 128)
-b Controls the batch size (default: 512), can be slightly increased
-t Controls the number of threads (default: 8), can be slightly increased
--repeat_penalty Controls the severity of the penalty for repeated text in the generated reply
--temp Temperature coefficient. The lower the value, the less randomness in the reply, and vice versa
--top_p, --top_k Control the parameters related to decoding sampling
For a more detailed official description, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main
The table below provides related statistics for different quantization methods for reference. The inference model is the Chinese-LLaMA-2-7B, and the test device is the M1 Max chip (8 performance cores, 2 efficiency cores). It reports both CPU speed (8 threads) and GPU speed, in units of ms/tok. The reported speed is the eval time
, i.e., the speed at which the model generates responses. For more information about quantization parameters, please refer to the llama.cpp quantization table.
Key findings:
- The default quantization method is q4_0, which is the fastest but has the greatest loss. Use q4_k instead.
- The speed is highest when the number of threads
-t
is equal to the number of physical cores. Exceeding this slows down the speed (on M1 Max, changing from 8 to 10 makes it three times slower) - If your machine resources are sufficient and you do not have a stringent speed requirement, you can use q8_0 or Q6_K, which are very close to the performance of the F16 model
F16 | Q2_K | Q3_K | Q4_0 | Q4_1 | Q4_K | Q5_0 | Q5_1 | Q5_K | Q6_K | Q8_0 | |
---|---|---|---|---|---|---|---|---|---|---|---|
PPL | 9.128 | 11.107 | 9.576 | 9.476 | 9.576 | 9.240 | 9.156 | 9.213 | 9.168 | 9.132 | 9.129 |
Size | 12.91G | 2.41G | 3.18G | 3.69G | 4.08G | 3.92G | 4.47G | 4.86G | 4.59G | 5.30G | 6.81G |
CPU Speed | 117 | 42 | 51 | 39 | 44 | 43 | 48 | 51 | 50 | 54 | 65 |
GPU Speed | 53 | 19 | 21 | 17 | 18 | 20 | x | x | 25 | 26 | x |
F16 | Q2_K | Q3_K | Q4_0 | Q4_1 | Q4_K | Q5_0 | Q5_1 | Q5_K | Q6_K | Q8_0 | |
---|---|---|---|---|---|---|---|---|---|---|---|
PPL | 8.810 | 12.804 | 9.738 | 9.371 | 9.549 | 8.952 | 8.988 | 8.924 | 8.858 | 8.820 | 8.811 |
Size | 24.69G | 5.26G | 6.02G | 7.01G | 7.77G | 7.48G | 8.52G | 9.28G | 8.76G | 10.13G | 13.05G |
CPU Speed | - | 75 | 90 | 76 | 80 | 80 | 91 | 99 | 92 | 104 | 125 |
GPU Speed | - | 31 | 37 | 30 | 32 | 36 | x | x | 47 | 51 | x |
llama.cpp also provides the functionality to set up a server, which can be used for API calls, setting up simple demos, and other purposes.
To launch the server, run the following command. The binary file ./server
is in the root directory of llama.cpp, and the service listens on 127.0.0.1:8080
by default. Here, you specify the model path and the context window size. If you need to use GPU decoding, you can also specify the -ngl
parameter.
$ ./server -m ./zh-models/7B/ggml-model-q4_0.gguf -c 4096 -ngl 999
After the service is launched, you can call it in various ways, such as using the curl
command. Here is an example script (also located in scripts/llamacpp/server_curl_example.sh
) that wraps the Alpaca-2 template and uses the curl
command for API access.
# server_curl_example.sh
SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。'
# SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。请你提供专业、有逻辑、内容真实、有价值的详细回复。' # Try this one, if you prefer longer response.
INSTRUCTION=$1
ALL_PROMPT="[INST] <<SYS>>\n$SYSTEM_PROMPT\n<</SYS>>\n\n$INSTRUCTION [/INST]"
CURL_DATA="{\"prompt\": \"$ALL_PROMPT\",\"n_predict\": 128}"
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data "$CURL_DATA"
For example, we provide an example command.
$ bash server_curl_example.sh '请列举5条文明乘车的建议'
Wait for the response to be returned.
{
"content": " 以下是五个文明乘车的建议:1)注意礼貌待人,不要大声喧哗或使用不雅用语;2)保持车厢整洁卫生,丢弃垃圾时要及时处理;3)不影响他人休息和正常工作时间,避免在车厢内做剧烈运动、吃零食等有异味的行为;4)遵守乘车纪律,尊重公共交通工具的规则和制度;5)若遇到突发状况或紧急情况,请保持冷静并配合相关部门的工作。这些建议旨在提高公民道德水平和社会文明程度,共同营造一个和谐、舒适的乘坐环境。",
"generation_settings": {
"frequency_penalty": 0.0,
"ignore_eos": false,
"logit_bias": [],
"mirostat": 0,
"mirostat_eta": 0.10000000149011612,
"mirostat_tau": 5.0,
"model": "zh-alpaca2-models/7b/ggml-model-q6_k.gguf",
"n_ctx": 4096,
"n_keep": 0,
"n_predict": 128,
"n_probs": 0,
"penalize_nl": true,
"presence_penalty": 0.0,
"repeat_last_n": 64,
"repeat_penalty": 1.100000023841858,
"seed": 4294967295,
"stop": [],
"stream": false,
"temp": 0.800000011920929,
"tfs_z": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"typical_p": 1.0
},
"model": "zh-alpaca2-models/7b/ggml-model-q6_k.gguf",
"prompt": " [INST] <<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n请列举5条文明乘车的建议 [/INST]",
"stop": true,
"stopped_eos": true,
"stopped_limit": false,
"stopped_word": false,
"stopping_word": "",
"timings": {
"predicted_ms": 3386.748,
"predicted_n": 120,
"predicted_per_second": 35.432219934875576,
"predicted_per_token_ms": 28.2229,
"prompt_ms": 0.0,
"prompt_n": 120,
"prompt_per_second": null,
"prompt_per_token_ms": 0.0
},
"tokens_cached": 162,
"tokens_evaluated": 43,
"tokens_predicted": 120,
"truncated": false
}
For a more detailed usage tutorial, please visit: https://github.com/ggerganov/llama.cpp/tree/master/examples/server