llamacpp_en

Deploy and Quantize Models Using llama.cpp

This wiki will walk you through the detailed steps of model quantization and local deployment using the llama.cpp tool. Windows users may need to install a compiler such as cmake. For quick local deployment, it is recommended to use the instruction-fine-tuned Alpaca-2 model. If possible, use the 6-bit or 8-bit models for better results. The following instructions use the Chinese Alpaca-2-7B model as an example. Before running, ensure that:

Your system has the make (built-in for MacOS/Linux) or cmake (needs to be installed on Windows) compiler tool
It is recommended to use Python 3.10 or above to compile and run this tool

Step 1: Clone and Compile llama.cpp

(Optional) If you have downloaded an old repository, it is recommended to pull the latest code with git pull and clean it with make clean
Pull the latest code from the llama.cpp repository

$ git clone https://github.com/ggerganov/llama.cpp

Compile the llama.cpp project to generate the ./main (for inference) and ./quantize (for quantization) binary files.

$ make

Windows/Linux users: It is recommended to compile with BLAS (or cuBLAS if you have a GPU) to speed up prompt processing. See llama.cpp#blas-build for reference.
macOS users: No additional steps required. llama.cpp is optimized for ARM NEON and has BLAS enabled by default. For M-series chips, it is recommended to use Metal for GPU inference to significantly improve speed. To do this, change the compilation command to LLAMA_METAL=1 make. See llama.cpp#metal-build for reference.

Step 2: Generate a Quantized Version of the Model

(💡 You can also directly download quantized model：gguf models)

Convert the full model weights (either .pth format or huggingface .bin format) to GGML's FP16 format, generating a file at zh-models/7B/ggml-model-f16.gguf. Further quantize the FP16 model to a 4-bit model, generating a quantized model file at zh-models/7B/ggml-model-q4_0.gguf. The performance comparison of different quantization methods can be found at the end of this wiki.

$ python convert.py zh-models/7B/
$ ./quantize ./zh-models/7B/ggml-model-f16.gguf ./zh-models/7B/ggml-model-q4_0.gguf q4_0

Step 3: Load and Start the Model

Since the Alpaca-2 launched by this project uses the instruction template of Llama-2-chat, please first copy scripts/llama-cpp/chat.sh of this project to the root directory of llama.cpp. The content of the chat.sh file is as follows, and the chat template and some default parameters are nested inside, which can be modified according to the actual situation. The --rope-scale is no longer needed to be set manually, as it is explicitly considered in the conversion process. You can just load long-context model as a normal one.

GPU Inference: if compiled with cuBLAS/Metal, specify the offload layers, e.g., -ngl 40 means offloading 40 layers of model parameters to GPU

#!/bin/bash

# temporary script to chat with Chinese Alpaca-2 model
# usage: ./chat.sh alpaca2-ggml-model-path your-first-instruction

SYSTEM='You are a helpful assistant. 你是一个乐于助人的助手。'
FIRST_INSTRUCTION=$2

./main -m $1 \
--color -i -c 4096 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 \
--in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>

$FIRST_INSTRUCTION [/INST]"

Start chatting with the following command.

$ chmod +x chat.sh
$ ./chat.sh zh-models/7B/ggml-model-q4_0.gguf '请列举5条文明乘车的建议'

After the > prompt, enter your prompt. Use cmd/ctrl+c to interrupt output, and use \ at the end of the line for multiline input. For help and parameter descriptions, execute the ./main -h command. Here are some common parameters:

-c Controls the length of the context. The larger the value, the longer the conversation history it can refer to (default: 512)
-ins Enables the instruction mode for ChatGPT-like conversation
-f Specifies the prompt template. For the alpaca model, load prompts/alpaca.txt
-n Controls the maximum length of the reply generated (default: 128)
-b Controls the batch size (default: 512), can be slightly increased
-t Controls the number of threads (default: 8), can be slightly increased
--repeat_penalty Controls the severity of the penalty for repeated text in the generated reply
--temp Temperature coefficient. The lower the value, the less randomness in the reply, and vice versa
--top_p, --top_k Control the parameters related to decoding sampling

For a more detailed official description, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main

Quantization Method and Inference Speed

The table below provides related statistics for different quantization methods for reference. The inference model is the Chinese-LLaMA-2-7B, and the test device is the M1 Max chip (8 performance cores, 2 efficiency cores). It reports both CPU speed (8 threads) and GPU speed, in units of ms/tok. The reported speed is the eval time, i.e., the speed at which the model generates responses. For more information about quantization parameters, please refer to the llama.cpp quantization table.

Key findings:

The default quantization method is q4_0, which is the fastest but has the greatest loss. Use q4_k instead.
The speed is highest when the number of threads -t is equal to the number of physical cores. Exceeding this slows down the speed (on M1 Max, changing from 8 to 10 makes it three times slower)
If your machine resources are sufficient and you do not have a stringent speed requirement, you can use q8_0 or Q6_K, which are very close to the performance of the F16 model

Chinese-LLaMA-2-7B

	F16	Q2_K	Q3_K	Q4_0	Q4_1	Q4_K	Q5_0	Q5_1	Q5_K	Q6_K	Q8_0
PPL	9.128	11.107	9.576	9.476	9.576	9.240	9.156	9.213	9.168	9.132	9.129
Size	12.91G	2.41G	3.18G	3.69G	4.08G	3.92G	4.47G	4.86G	4.59G	5.30G	6.81G
CPU Speed	117	42	51	39	44	43	48	51	50	54	65
GPU Speed	53	19	21	17	18	20	x	x	25	26	x

Chinese-LLaMA-2-13B

	F16	Q2_K	Q3_K	Q4_0	Q4_1	Q4_K	Q5_0	Q5_1	Q5_K	Q6_K	Q8_0
PPL	8.810	12.804	9.738	9.371	9.549	8.952	8.988	8.924	8.858	8.820	8.811
Size	24.69G	5.26G	6.02G	7.01G	7.77G	7.48G	8.52G	9.28G	8.76G	10.13G	13.05G
CPU Speed	-	75	90	76	80	80	91	99	92	104	125
GPU Speed	-	31	37	30	32	36	x	x	47	51	x

Extended Application: Setting Up a Server

llama.cpp also provides the functionality to set up a server, which can be used for API calls, setting up simple demos, and other purposes.

To launch the server, run the following command. The binary file ./server is in the root directory of llama.cpp, and the service listens on 127.0.0.1:8080 by default. Here, you specify the model path and the context window size. If you need to use GPU decoding, you can also specify the -ngl parameter.

$ ./server -m ./zh-models/7B/ggml-model-q4_0.gguf -c 4096 -ngl 999

After the service is launched, you can call it in various ways, such as using the curl command. Here is an example script (also located in scripts/llamacpp/server_curl_example.sh) that wraps the Alpaca-2 template and uses the curl command for API access.

# server_curl_example.sh

SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。'
# SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。请你提供专业、有逻辑、内容真实、有价值的详细回复。' # Try this one, if you prefer longer response.
INSTRUCTION=$1
ALL_PROMPT="[INST] <<SYS>>\n$SYSTEM_PROMPT\n<</SYS>>\n\n$INSTRUCTION [/INST]"
CURL_DATA="{\"prompt\": \"$ALL_PROMPT\",\"n_predict\": 128}"

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data "$CURL_DATA"

For example, we provide an example command.

$ bash server_curl_example.sh '请列举5条文明乘车的建议'

Wait for the response to be returned.

{
	"content": " 以下是五个文明乘车的建议：1）注意礼貌待人，不要大声喧哗或使用不雅用语；2）保持车厢整洁卫生，丢弃垃圾时要及时处理；3）不影响他人休息和正常工作时间，避免在车厢内做剧烈运动、吃零食等有异味的行为；4）遵守乘车纪律，尊重公共交通工具的规则和制度；5）若遇到突发状况或紧急情况，请保持冷静并配合相关部门的工作。这些建议旨在提高公民道德水平和社会文明程度，共同营造一个和谐、舒适的乘坐环境。",
	"generation_settings": {
		"frequency_penalty": 0.0,
		"ignore_eos": false,
		"logit_bias": [],
		"mirostat": 0,
		"mirostat_eta": 0.10000000149011612,
		"mirostat_tau": 5.0,
		"model": "zh-alpaca2-models/7b/ggml-model-q6_k.gguf",
		"n_ctx": 4096,
		"n_keep": 0,
		"n_predict": 128,
		"n_probs": 0,
		"penalize_nl": true,
		"presence_penalty": 0.0,
		"repeat_last_n": 64,
		"repeat_penalty": 1.100000023841858,
		"seed": 4294967295,
		"stop": [],
		"stream": false,
		"temp": 0.800000011920929,
		"tfs_z": 1.0,
		"top_k": 40,
		"top_p": 0.949999988079071,
		"typical_p": 1.0
	},
	"model": "zh-alpaca2-models/7b/ggml-model-q6_k.gguf",
	"prompt": " [INST] <<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n请列举5条文明乘车的建议 [/INST]",
	"stop": true,
	"stopped_eos": true,
	"stopped_limit": false,
	"stopped_word": false,
	"stopping_word": "",
	"timings": {
		"predicted_ms": 3386.748,
		"predicted_n": 120,
		"predicted_per_second": 35.432219934875576,
		"predicted_per_token_ms": 28.2229,
		"prompt_ms": 0.0,
		"prompt_n": 120,
		"prompt_per_second": null,
		"prompt_per_token_ms": 0.0
	},
	"tokens_cached": 162,
	"tokens_evaluated": 43,
	"tokens_predicted": 120,
	"truncated": false
}