Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP docs(README): add lmdeploy #152

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tpoisonooo
Copy link

这是 lmdeploy 相关的知乎介绍和测试结果

由于 wiki 没法 PR,只能 owner 调整 wiki 目录,我 fork 了 Chinese-LLaMA-Alpaca-2, 增加了

这是两个文档原始的 markdown 内容:

lmdeploy 安装和使用

lmdeploy 支持 transformer 结构(例如 LLaMA、LLaMa2、InternLM、Vicuna 等),目前支持 fp16,int8 和 int4。

一、安装

安装预编译的 python 包

python3 -m pip install lmdeploy

二、fp16 推理

把模型转成 lmdeploy 推理格式,假设 huggingface 版 LLaMa2 模型已下载到 /models/llama-2-7b-chat 目录,结果会存到 workspace 文件夹

python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat

在命令行中测试聊天效果

python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you

..
Hello! I'm just an AI assistant ..

也可以用 gradio 启动 WebUI 来聊天

python3 -m lmdeploy.serve.gradio.app ./workspace

lmdeploy 同样支持原始的 facebook 模型格式、支持 70B 模型分布式推理,用法请查看 lmdeploy 官方文档

三、kv cache int8 量化

lmdeploy 实现了 kv cache int8 量化,同样的显存可以服务更多并发用户。

首先获取量化参数,结果保存到 fp16 转换好的 workspace/triton_models/weights 下,7B 模型也不需要 tensor parallel。

python3 -m lmdeploy.lite.apis.kv_qparams \ 
  --work_dir /models/llama-2-7b-chat \                 # huggingface 格式模型
  --turbomind_dir ./workspace/triton_models/weights \  # 结果保存目录
  --kv_sym False \                                     # 用非对称量化
  --num_tp 1                                           # tensor parallel GPU 个数

然后修改推理配置,开启 kv cache int8。编辑 workspace/triton_models/weights/config.ini

  • use_context_fmha 改为 0,表示关闭 flashattention
  • quant_policy 设为 4,表示打开 kv cache 量化

最终执行测试即可

python3 -m lmdeploy.turbomind.chat ./workspace

点击这里 查看 kv cache int8 量化实现公式、精度和显存测试报告。

四、weight int4 量化

lmdeploy 基于 AWQ 算法 实现了 weight int4 量化,相对 fp16 版本,速度是 3.16 倍、显存从 16G 降低到 6.3G。

这里有 AWQ 算法优化好 llama2 原始模型,直接下载。

git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

对于自己的模型,可以用auto_awq工具来优化,假设你的 huggingface 模型保存在 /models/llama-2-7b-chat

python3 -m lmdeploy.lite.apis.auto_awq \
  --model /models/llama-2-7b-chat \
  --w_bits 4 \                       # 权重量化的 bit 数
  --w_group_size 128 \               # 权重量化分组统计尺寸
  --work_dir ./llama2-chat-7b-w4     # 保存量化参数的目录

执行以下命令,即可在终端与模型对话:

## 转换模型的layout,存放在默认路径 ./workspace 下
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## 推理
python3 -m lmdeploy.turbomind.chat ./workspace

点击这里 查看 weight int4 量化的显存和速度测试结果。

额外说明,weight int4 和 kv cache int8 二者并不冲突、可以同时打开,节约更多显存。

lmdeploy Usage

lmdeploy supports transformer structures (such as LLaMA, LLaMa2, InternLM, Vicuna, etc.), currently supporting fp16, int8, and int4.

I. Installation

Install the precompiled python package

python3 -m pip install lmdeploy

II. fp16 Inference

Convert the model to lmdeploy inference format, assuming the huggingface version of the LLaMa2 model has been downloaded to the /models/llama-2-7b-chat directory, and the results will be stored in the workspace folder

python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat

Test the chat on the command line

python3 -m lmdeploy.turbomind.chat ./workspace
..
double enter to end input >>> who are you

..
Hello! I'm just an AI assistant ..

You can also start WebUI to chat with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

lmdeploy also supports the original Facebook model format and supports 70B model distributed inference. For usage, please refer to lmdeploy official documentation.

III. kv cache int8 Quantization

lmdeploy implements kv cache int8 quantization, and the same memory can serve more concurrent users.

First obtain the quantization parameters, the result is saved in workspace/triton_models/weights after fp16 conversion, and there is no need for tensor parallel for the 7B model.

python3 -m lmdeploy.lite.apis.kv_qparams \ 
  --work_dir /models/llama-2-7b-chat \                 # huggingface format model
  --turbomind_dir ./workspace/triton_models/weights \  # Result save directory
  --kv_sym False \                                     # Use asymmetric quantization
  --num_tp 1                                           # Number of tensor parallel GPUs

Then modify the inference configuration to enable kv cache int8. Edit workspace/triton_models/weights/config.ini

  • Change use_context_fmha to 0, indicating that flashattention is turned off
  • Set quant_policy to 4, indicating that kv cache quantization is enabled

Finally execute the test

python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the kv cache int8 quantization implementation formula, accuracy and memory test report.

IV. weight int4 Quantization

lmdeploy based on the AWQ algorithm implemented weight int4 quantization, relative to the fp16 version, the speed is 3.16 times, and the memory is reduced from 16G to 6.3G.

Here is the AWQ algorithm optimized llama2 original model, you can just download it.

git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

For your own model, you can use the auto_awq tool to optimize it, assuming your huggingface model is saved in /models/llama-2-7b-chat

python3 -m lmdeploy.lite.apis.auto_awq \
  --model /models/llama-2-7b-chat \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Weight Quantization Group Statistical Size
  --work_dir ./llama2-chat-7b-w4     # Directory to save quantization parameters

Run the following command to chat with the model in the terminal:

## Convert the model's layout and store it in the default path ./workspace
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## Inference
python3 -m lmdeploy.turbomind.chat ./workspace

Click here to view the memory and speed test results of weight int4 quantization.

Additionally, weight int4 and kv cache int8 do not conflict and can be turned on at the same time to save more memory.

@tpoisonooo
Copy link
Author

@ymcui please review.

@ymcui
Copy link
Owner

ymcui commented Aug 18, 2023

Thanks for your contribution. We'll schedule a PR review asap.
Note that we might make necessary modifications to README.md to satisfy our editing policies.

@GoGoJoestar
Copy link
Collaborator

I tried inference by TurboMind as following, but it didn't output any response.

2

Try Inference with PyTorch and it works:

1

Does TurboMind have requirement on GPU? I run it on a P40 gpu.

@tpoisonooo
Copy link
Author

I tried inference by TurboMind as following, but it didn't output any response.

2

Try Inference with PyTorch and it works:

1

Does TurboMind have requirement on GPU? I run it on a P40 gpu.

P40 not support fp16 precision, so it does not work. We have had tested on 3080/4090/A100/A10. Let me update the doc.

@tpoisonooo tpoisonooo changed the title docs(README): add lmdeploy WIP docs(README): add lmdeploy Aug 23, 2023
@tpoisonooo
Copy link
Author

For alpaca model, lmdeploy still needs a chat template, so this PR is WIP.
I will update PR status after finished.

cc @ymcui @GoGoJoestar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants