Skip to content

Commit

Permalink
Add speed benchmark examples (#1068)
Browse files Browse the repository at this point in the history
* add qwen2.5 perf report

* update readme

* rebuild docs and fix format issue

* remove fuzzy in speed_benchmark.po

* fix issue

* recover function_call.po

* update

* remove unused code in speed_benchmark.po

* add example

* add readme for speed benchmark scripts

* update readme

* update readme

* update

* refine code

* fix pr issue

* fix some issue for PR

* update installation

* add --generate_length param

* update

* update requirements

* Update README_zh.md

* Update README.md

---------

Co-authored-by: Ren Xuancheng <[email protected]>
  • Loading branch information
wangxingjun778 and jklj077 authored Nov 18, 2024
1 parent f45f6b4 commit 8205ba2
Show file tree
Hide file tree
Showing 6 changed files with 658 additions and 0 deletions.
106 changes: 106 additions & 0 deletions examples/speed-benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
## Speed Benchmark

This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).

### 1. Model Collections

For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).

For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).

### 2. Environment Installation


For inference using HuggingFace transformers:

```shell
conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers

pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/[email protected]
pip install git+https://github.com/Dao-AILab/[email protected]
pip install -r requirements-perf-transformers.txt
```

> [!Important]
> - For `flash-attention`, you can use the prebulit wheels from [GitHub Releases](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) or installing from source, which requires a compatible CUDA compiler.
> - You don't actually need to install `flash-attention`. It has been intergrated into `torch` as a backend of `sdpa`.
> - For `auto_gptq` to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible `torch` versions. Installing from source also requires a compatible CUDA compiler.
> - For `autoawq` to use efficent kenerls, you need `autoawq-kernels`, which should be automatically installed. If not, run `pip install autoawq-kernels`.
For inference using vLLM:

```shell
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm

pip install -r requirements-perf-vllm.txt
```


### 3. Run Experiments

#### 3.1 Inference using HuggingFace Transformers

- Use HuggingFace hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
```

- Use ModelScope hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
```

Parameters:

`--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.
`--generate_length`: Output length in tokens; default is 2048.
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5`
`--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.
`--outputs_dir`: Output directory; default is outputs/transformers.


#### 3.2 Inference using vLLM

- Use HuggingFace hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```


- Use ModelScope hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```


Parameters:

`--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.
`--generate_length`: Output length in tokens; default is 2048.
`--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5`
`--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.
`--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.
`--outputs_dir`: Output directory; default is outputs/vllm.
`--enforce_eager`: Whether to enforce eager mode; default is False.



#### 3.3 Tips

- Run multiple experiments and compute the average result; a typical number is 3 times.
- Make sure the GPU is idle before running experiments.


### 4. Results

Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
110 changes: 110 additions & 0 deletions examples/speed-benchmark/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
## 效率评估

本文介绍Qwen2.5系列模型(原始模型和量化模型)的效率测试流程,详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)

### 1. 模型资源

对于托管在HuggingFace上的模型,可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)

对于托管在ModelScope上的模型,可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)


### 2. 环境安装


使用HuggingFace transformers推理,安装环境如下:

```shell
conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers

pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/[email protected]
pip install git+https://github.com/Dao-AILab/[email protected]
pip install -r requirements-perf-transformers.txt
```

> [!Important]
> - 对于 `flash-attention`,您可以从 [GitHub 发布页面](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) 使用预编译的 wheel 包进行安装,或者从源代码安装,后者需要一个兼容的 CUDA 编译器。
> - 实际上,您并不需要单独安装 `flash-attention`。它已经被集成到了 `torch` 中作为 `sdpa` 的后端实现。
> - 若要使 `auto_gptq` 使用高效的内核,您需要从源代码安装,因为预编译的 wheel 包依赖于与之不兼容的 `torch` 版本。从源代码安装同样需要一个兼容的 CUDA 编译器。
> - 若要使 `autoawq` 使用高效的内核,您需要安装 `autoawq-kernels`,该组件应当会自动安装。如果未自动安装,请运行 `pip install autoawq-kernels` 进行手动安装。

使用vLLM推理,安装环境如下:

```shell
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm

pip install -r requirements-perf-vllm.txt
```


### 3. 执行测试

#### 3.1 使用HuggingFace transformers推理

- 使用HuggingFace hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers

# 指定HF_ENDPOINT
HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
```

- 使用ModelScope hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
```

参数说明:

`--model_id_or_path`: 模型ID或本地路径, 可选值参考`模型资源`章节
`--context_length`: 输入长度,单位为token数;可选值为1, 6144, 14336, 30720, 63488, 129024;具体可参考`Qwen2.5模型效率评估报告`
`--generate_length`: 生成token数量;默认为2048
`--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES,例如`0,1,2,3`,`4,5`
`--use_modelscope`: 如果设置该值,则使用ModelScope加载模型,否则使用HuggingFace
`--outputs_dir`: 输出目录, 默认为`outputs/transformers`


#### 3.2 使用vLLM推理

- 使用HuggingFace hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm

# 指定HF_ENDPOINT
HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```

- 使用ModelScope hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```

参数说明:

`--model_id_or_path`: 模型ID或本地路径, 可选值参考`模型资源`章节
`--context_length`: 输入长度,单位为token数;可选值为1, 6144, 14336, 30720, 63488, 129024;具体可参考`Qwen2.5模型效率评估报告`
`--generate_length`: 生成token数量;默认为2048
`--max_model_len`: 模型最大长度,单位为token数;默认为32768
`--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES,例如`0,1,2,3`,`4,5`
`--use_modelscope`: 如果设置该值,则使用ModelScope加载模型,否则使用HuggingFace
`--gpu_memory_utilization`: GPU内存利用率,取值范围为(0, 1];默认为0.9
`--outputs_dir`: 输出目录, 默认为`outputs/vllm`
`--enforce_eager`: 是否强制使用eager模式;默认为False


#### 3.3 注意事项

1. 多次测试,取平均值,典型值为3次
2. 测试前请确保GPU处于空闲状态,避免其他任务影响测试结果

### 4. 测试结果

测试结果详见`outputs`目录下的文件,默认包括`transformers``vllm`两个目录,分别存放HuggingFace transformers和vLLM的测试结果。
10 changes: 10 additions & 0 deletions examples/speed-benchmark/requirements-perf-transformers.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Note: install following requirements saparately
# pip install torch==2.3.1
# pip install git+https://github.com/AutoGPTQ/[email protected]
# pip install git+https://github.com/Dao-AILab/[email protected]

transformers==4.46.0
autoawq==0.2.6
modelscope[framework]
accelerate
optimum>=1.20.0
4 changes: 4 additions & 0 deletions examples/speed-benchmark/requirements-perf-vllm.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
vllm==0.6.3.post1
torch==2.4.0
modelscope[framework]
accelerate
Loading

0 comments on commit 8205ba2

Please sign in to comment.