Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
这是 lmdeploy 相关的知乎介绍和测试结果
由于 wiki 没法 PR,只能 owner 调整 wiki 目录,我 fork 了 Chinese-LLaMA-Alpaca-2, 增加了
这是两个文档原始的 markdown 内容:
lmdeploy 安装和使用
lmdeploy 支持 transformer 结构(例如 LLaMA、LLaMa2、InternLM、Vicuna 等),目前支持 fp16,int8 和 int4。
一、安装
安装预编译的 python 包
二、fp16 推理
把模型转成 lmdeploy 推理格式,假设 huggingface 版 LLaMa2 模型已下载到
/models/llama-2-7b-chat
目录,结果会存到workspace
文件夹在命令行中测试聊天效果
也可以用 gradio 启动 WebUI 来聊天
lmdeploy 同样支持原始的 facebook 模型格式、支持 70B 模型分布式推理,用法请查看 lmdeploy 官方文档。
三、kv cache int8 量化
lmdeploy 实现了 kv cache int8 量化,同样的显存可以服务更多并发用户。
首先获取量化参数,结果保存到 fp16 转换好的
workspace/triton_models/weights
下,7B 模型也不需要 tensor parallel。然后修改推理配置,开启 kv cache int8。编辑
workspace/triton_models/weights/config.ini
use_context_fmha
改为 0,表示关闭 flashattentionquant_policy
设为 4,表示打开 kv cache 量化最终执行测试即可
点击这里 查看 kv cache int8 量化实现公式、精度和显存测试报告。
四、weight int4 量化
lmdeploy 基于 AWQ 算法 实现了 weight int4 量化,相对 fp16 版本,速度是 3.16 倍、显存从 16G 降低到 6.3G。
这里有 AWQ 算法优化好 llama2 原始模型,直接下载。
对于自己的模型,可以用
auto_awq
工具来优化,假设你的 huggingface 模型保存在/models/llama-2-7b-chat
执行以下命令,即可在终端与模型对话:
点击这里 查看 weight int4 量化的显存和速度测试结果。
额外说明,weight int4 和 kv cache int8 二者并不冲突、可以同时打开,节约更多显存。
lmdeploy Usage
lmdeploy supports transformer structures (such as LLaMA, LLaMa2, InternLM, Vicuna, etc.), currently supporting fp16, int8, and int4.
I. Installation
Install the precompiled python package
II. fp16 Inference
Convert the model to lmdeploy inference format, assuming the huggingface version of the LLaMa2 model has been downloaded to the
/models/llama-2-7b-chat
directory, and the results will be stored in theworkspace
folderTest the chat on the command line
You can also start WebUI to chat with gradio
lmdeploy also supports the original Facebook model format and supports 70B model distributed inference. For usage, please refer to lmdeploy official documentation.
III. kv cache int8 Quantization
lmdeploy implements kv cache int8 quantization, and the same memory can serve more concurrent users.
First obtain the quantization parameters, the result is saved in
workspace/triton_models/weights
after fp16 conversion, and there is no need for tensor parallel for the 7B model.Then modify the inference configuration to enable kv cache int8. Edit
workspace/triton_models/weights/config.ini
use_context_fmha
to 0, indicating that flashattention is turned offquant_policy
to 4, indicating that kv cache quantization is enabledFinally execute the test
Click here to view the kv cache int8 quantization implementation formula, accuracy and memory test report.
IV. weight int4 Quantization
lmdeploy based on the AWQ algorithm implemented weight int4 quantization, relative to the fp16 version, the speed is 3.16 times, and the memory is reduced from 16G to 6.3G.
Here is the AWQ algorithm optimized llama2 original model, you can just download it.
For your own model, you can use the
auto_awq
tool to optimize it, assuming your huggingface model is saved in/models/llama-2-7b-chat
Run the following command to chat with the model in the terminal:
Click here to view the memory and speed test results of weight int4 quantization.
Additionally, weight int4 and kv cache int8 do not conflict and can be turned on at the same time to save more memory.