Add speed benchmark examples (#1068)

* add qwen2.5 perf report * update readme * rebuild docs and fix format issue * remove fuzzy in speed_benchmark.po * fix issue * recover function_call.po * update * remove unused code in speed_benchmark.po * add example * add readme for speed benchmark scripts * update readme * update readme * update * refine code * fix pr issue * fix some issue for PR * update installation * add --generate_length param * update * update requirements * Update README_zh.md * Update README.md --------- Co-authored-by: Ren Xuancheng <[email protected]>
QwenLM · Nov 18, 2024 · 8205ba2 · 8205ba2
1 parent f45f6b4
commit 8205ba2
Show file tree

Hide file tree

Showing 6 changed files with 658 additions and 0 deletions.
diff --git a/examples/speed-benchmark/README.md b/examples/speed-benchmark/README.md
@@ -0,0 +1,106 @@
+## Speed Benchmark
+
+This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
+
+### 1. Model Collections
+
+For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
+
+For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
+
+### 2. Environment Installation
+
+
+For inference using HuggingFace transformers:
+
+```shell
+conda create -n qwen_perf_transformers python=3.10
+conda activate qwen_perf_transformers
+
+pip install torch==2.3.1
+pip install git+https://github.com/AutoGPTQ/[email protected]
+pip install git+https://github.com/Dao-AILab/[email protected]
+pip install -r requirements-perf-transformers.txt
+```
+
+> [!Important]
+> - For `flash-attention`, you can use the prebulit wheels from [GitHub Releases](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) or installing from source, which requires a compatible CUDA compiler.
+>   - You don't actually need to install `flash-attention`. It has been intergrated into `torch` as a backend of `sdpa`.
+> - For `auto_gptq` to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible `torch` versions. Installing from source also requires a compatible CUDA compiler.
+> - For `autoawq` to use efficent kenerls, you need `autoawq-kernels`, which should be automatically installed. If not, run `pip install autoawq-kernels`.
+
+For inference using vLLM:
+
+```shell
+conda create -n qwen_perf_vllm python=3.10
+conda activate qwen_perf_vllm
+
+pip install -r requirements-perf-vllm.txt
+```
+
+
+### 3. Run Experiments
+
+#### 3.1 Inference using HuggingFace Transformers
+
+- Use HuggingFace hub
+
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
+
+- Use ModelScope hub
+
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+```
+
+Parameters:
+
+    `--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
+    `--generate_length`: Output length in tokens; default is 2048.
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
+    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
+    `--outputs_dir`: Output directory; default is outputs/transformers.  
+
+
+#### 3.2 Inference using vLLM
+
+- Use HuggingFace hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+
+
+- Use ModelScope hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+
+
+Parameters:
+
+    `--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
+    `--generate_length`: Output length in tokens; default is 2048.
+    `--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
+    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
+    `--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.  
+    `--outputs_dir`: Output directory; default is outputs/vllm.  
+    `--enforce_eager`: Whether to enforce eager mode; default is False.  
+
+
+
+#### 3.3 Tips
+
+- Run multiple experiments and compute the average result; a typical number is 3 times.
+- Make sure the GPU is idle before running experiments.
+
+
+### 4. Results
+
+Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
diff --git a/examples/speed-benchmark/README_zh.md b/examples/speed-benchmark/README_zh.md
@@ -0,0 +1,110 @@
+## 效率评估
+
+本文介绍Qwen2.5系列模型（原始模型和量化模型）的效率测试流程，详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)。
+
+### 1. 模型资源
+
+对于托管在HuggingFace上的模型，可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)。
+
+对于托管在ModelScope上的模型，可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)。
+
+
+### 2. 环境安装
+
+
+使用HuggingFace transformers推理，安装环境如下：
+
+```shell
+conda create -n qwen_perf_transformers python=3.10
+conda activate qwen_perf_transformers
+
+pip install torch==2.3.1
+pip install git+https://github.com/AutoGPTQ/[email protected]
+pip install git+https://github.com/Dao-AILab/[email protected]
+pip install -r requirements-perf-transformers.txt
+```
+
+> [!Important]
+> - 对于 `flash-attention`，您可以从 [GitHub 发布页面](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) 使用预编译的 wheel 包进行安装，或者从源代码安装，后者需要一个兼容的 CUDA 编译器。
+>   - 实际上，您并不需要单独安装 `flash-attention`。它已经被集成到了 `torch` 中作为 `sdpa` 的后端实现。
+> - 若要使 `auto_gptq` 使用高效的内核，您需要从源代码安装，因为预编译的 wheel 包依赖于与之不兼容的 `torch` 版本。从源代码安装同样需要一个兼容的 CUDA 编译器。
+> - 若要使 `autoawq` 使用高效的内核，您需要安装 `autoawq-kernels`，该组件应当会自动安装。如果未自动安装，请运行 `pip install autoawq-kernels` 进行手动安装。
+
+
+使用vLLM推理，安装环境如下：
+
+```shell
+conda create -n qwen_perf_vllm python=3.10
+conda activate qwen_perf_vllm
+
+pip install -r requirements-perf-vllm.txt
+```
+
+
+### 3. 执行测试
+
+#### 3.1 使用HuggingFace transformers推理
+
+- 使用HuggingFace hub
+
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+
+# 指定HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
+
+- 使用ModelScope hub
+
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+```
+
+参数说明：
+
+    `--model_id_or_path`: 模型ID或本地路径， 可选值参考`模型资源`章节  
+    `--context_length`: 输入长度，单位为token数；可选值为1, 6144, 14336, 30720, 63488, 129024；具体可参考`Qwen2.5模型效率评估报告`  
+    `--generate_length`: 生成token数量；默认为2048
+    `--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES，例如`0,1,2,3`，`4,5`  
+    `--use_modelscope`: 如果设置该值，则使用ModelScope加载模型，否则使用HuggingFace  
+    `--outputs_dir`: 输出目录， 默认为`outputs/transformers`  
+
+
+#### 3.2 使用vLLM推理
+
+- 使用HuggingFace hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+
+# 指定HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+
+- 使用ModelScope hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+
+参数说明：
+
+    `--model_id_or_path`: 模型ID或本地路径， 可选值参考`模型资源`章节  
+    `--context_length`: 输入长度，单位为token数；可选值为1, 6144, 14336, 30720, 63488, 129024；具体可参考`Qwen2.5模型效率评估报告`  
+    `--generate_length`: 生成token数量；默认为2048
+    `--max_model_len`: 模型最大长度，单位为token数；默认为32768  
+    `--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES，例如`0,1,2,3`，`4,5`   
+    `--use_modelscope`: 如果设置该值，则使用ModelScope加载模型，否则使用HuggingFace  
+    `--gpu_memory_utilization`: GPU内存利用率，取值范围为(0, 1]；默认为0.9  
+    `--outputs_dir`: 输出目录， 默认为`outputs/vllm`  
+    `--enforce_eager`: 是否强制使用eager模式；默认为False  
+
+
+#### 3.3 注意事项
+
+1. 多次测试，取平均值，典型值为3次
+2. 测试前请确保GPU处于空闲状态，避免其他任务影响测试结果
+
+### 4. 测试结果
+
+测试结果详见`outputs`目录下的文件，默认包括`transformers`和`vllm`两个目录，分别存放HuggingFace transformers和vLLM的测试结果。
diff --git a/examples/speed-benchmark/requirements-perf-transformers.txt b/examples/speed-benchmark/requirements-perf-transformers.txt
@@ -0,0 +1,10 @@
+# Note: install following requirements saparately
+# pip install torch==2.3.1
+# pip install git+https://github.com/AutoGPTQ/[email protected]
+# pip install git+https://github.com/Dao-AILab/[email protected]
+
+transformers==4.46.0
+autoawq==0.2.6
+modelscope[framework]
+accelerate
+optimum>=1.20.0
diff --git a/examples/speed-benchmark/requirements-perf-vllm.txt b/examples/speed-benchmark/requirements-perf-vllm.txt
@@ -0,0 +1,4 @@
+vllm==0.6.3.post1
+torch==2.4.0
+modelscope[framework]
+accelerate