Skip to content

Commit

Permalink
Merge branch 'meta-llama:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
himanshushukla12 authored Oct 27, 2024
2 parents 597e44e + 26f10a6 commit 77f929f
Show file tree
Hide file tree
Showing 59 changed files with 8,884 additions and 413 deletions.
39 changes: 39 additions & 0 deletions .github/scripts/spellcheck_conf/wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1466,3 +1466,42 @@ OCRVQA
OCRVQADataCollator
ocrvqa
langchain
GiB
Terraform
gb
TPOT
ctrl
finetunes
llmcompressor
prefill
qps
terraform
tf
tmux
tpot
ttft
uv
8xL40S
xL
EDA
DeepLearningai
NotebookLM
NotebookLlama
Parler
TTS
parler
suno
tts
Hifigan
MeloTTS
Metavoice
Parler
Parler's
Reddit
Suno
VALL
WhisperSpeech
locallama
myshell
parler
xTTS
19 changes: 6 additions & 13 deletions .github/workflows/pytest_cpu_gha_runner.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,10 @@
name: "[GHA][CPU] llama-recipes Pytest tests on CPU GitHub hosted runner."
on:
pull_request:
branches:
branches:
- 'main'
paths:
- 'src/llama-recipes/configs/*.py'
- 'src/llama-recipes/utils/*.py'
- 'src/llama-recipes/datasets/*.py'
- 'src/llama-recipes/data/*.py'
- 'src/llama-recipes/*.py'

# triggers workflow manually for debugging purposes.
# triggers workflow manually for debugging purposes.
workflow_dispatch:
inputs:
runner:
Expand All @@ -23,8 +17,8 @@ on:
required: false
default: "true"

env:
PYTORCH_WHEEL_URL: https://download.pytorch.org/whl/test/cu118
env:
PYTORCH_WHEEL_URL: https://download.pytorch.org/whl/test/cu118

jobs:
execute_workflow:
Expand Down Expand Up @@ -63,19 +57,18 @@ jobs:
id: install_llama_recipes_package
run: |
echo "Installing 'llama-recipes' project (re: https://github.com/facebookresearch/llama-recipes?tab=readme-ov-file#install-with-optional-dependencies)"
pip install --extra-index-url ${PYTORCH_WHEEL_URL} -e '.[tests]'
pip install --extra-index-url ${PYTORCH_WHEEL_URL} -e '.[tests]'
- name: "Running PyTest tests on GHA CPU Runner"
id: pytest
run: |
echo "Running PyTest tests at 'GITHUB_WORKSPACE' path: ${GITHUB_WORKSPACE}"
cd $GITHUB_WORKSPACE && python3 -m pytest --junitxml="$GITHUB_WORKSPACE/result.xml"
- name: Publish Test Summary
id: test_summary
uses: test-summary/action@v2
with:
paths: "**/*.xml"
if: always()

2 changes: 1 addition & 1 deletion docs/multi_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:

1. [PEFT](https://huggingface.co/blog/peft) methods and in particular using the Hugging Face [PEFT](https://github.com/huggingface/peft)library.

2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning.md).

Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 8B model on multiple GPUs in one node.
For big models like 405B we will need to fine-tune in a multi-node setup even if 4bit quantization is enabled.
Expand Down
11 changes: 11 additions & 0 deletions recipes/3p_integrations/crusoe/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Below are recipes for deploying common Llama workflows on [Crusoe's](https://crusoe.ai) high-performance, sustainable cloud. Each workflow corresponds to a subfolder with its own README and supplemental materials. Please reference the table below for hardware requirements.

| Workflow | Model(s) | VM type | Storage |
|:----: | :----: | :----:| :----: |
| [Serving Llama3.1 in FP8 with vLLM](vllm-fp8/) | [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct), [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | l40s-48gb.8x | 256 GiB Persistent Disk |

# Requirements
First, ensure that you have a Crusoe account (you can sign up [here](https://console.crusoecloud.com/)). We will provision resources using Terraform, please ensure that your environment is configured and refer to the Crusoe [docs](https://github.com/crusoecloud/terraform-provider-crusoe?tab=readme-ov-file#getting-started) for guidance.

# Serving Models
Some recipes in this repo require firewall rules to expose ports in order to reach the inference server. To manage firewall rules, please refer to our [networking documentation](https://docs.crusoecloud.com/networking/firewall-rules/managing-firewall-rules).
85 changes: 85 additions & 0 deletions recipes/3p_integrations/crusoe/vllm-fp8/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
In this article, we will show how to benchmark FP8 models on L40S using the vLLM inference engine. At the end, you should have an understanding of how to use `llm-compressor` to create quantize existing Llama3 finetunes in higher precision to fp8, benchmark throughput and latency to compare performance, and finally serve models using `vllm`.

# Provisioning Resources
First, navigate to this repository from your local machine. Update the corresponding variables in `locals` inside `main.tf` to match your environment (e.g. the path to your SSH key), then initialize the terraform project with `terraform init` and provision resources with `terraform apply`. Note that this will create a VM equipped with 8xL40S and a 256GB persistent disk. After the VM has been created, terraform will output the public IP address.

## Mount Storage
`ssh` into your VM. Then, run the below commands to mount the attached disk to `/scratch`.
```bash
mkfs.ext4 /dev/vdb
mkdir /scratch
mount -t ext4 /dev/vdb /scratch
cd /scratch
```

# Install Dependencies
We'll use [uv](https://github.com/astral-sh/uv) to install dependencies. First, install the tool with
```bash
apt-get update && apt-get install -y curl
apt-get install tmux
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env
```

Now, clone the recipes and navigate to this tutorial. Initialize the virtual environment and install dependencies:
```bash
git clone https://github.com/meta-llama/llama-recipes.git
cd llama-recipes/recipes/3p_integrations/crusoe/vllm-fp8/
uv add vllm setuptools
```

# Run Benchmarks
Before starting the vLLM server, we'll configure HuggingFace to save to our shared disk, specify the model tag, and set tensor parallelism to 1.
```bash
export HF_HOME=/scratch/
export MODEL=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
export TP_SIZE=1
```
Now, we'll use tmux to run our server inside of a detachable session.
```bash
tmux new -s server
uv run vllm serve $MODEL --enable-chunked-prefill --disable-log-requests --tensor-parallel-size $TP_SIZE
```
vLLM will download the model from HF and serve it on port 8000. Now, detach from the tmux session (`ctrl+b` then `d`) and we'll simulate a client.
```bash
tmux new -s client
chmod +x run_benchmark.sh
./run_benchmark.sh
```
Let's inspect the benchmark script to see what's going on.
```bash
TOTAL_SECONDS=120
QPS_RATES=("1" "3" "5" "7" "9")

for QPS in ${QPS_RATES[@]}; do
NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="

uv run benchmarks/benchmark_serving.py \
--model $MODEL \
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 --dataset-path benchmarks/sonnet.txt \
--num-prompts $NUM_PROMPTS --request-rate $QPS --save-result
done
```
This is a convenience wrapper that re-runs the vLLM `benchmarks/benchmark_serving.py` with queries-per-second (QPS) gradually increasing from 1 to 9 and saves the results. After each run completes, a JSON will appear in the same directory containing inference statistics.

# Results
We repeated the above benchmark across the fp8 and fp16 versions of both Llama3.1 8B and 70B.

![TPOT vs QPS](assets/tpot_vs_qps_chart.png "TPOT vs QPS")
In the above chart, we compare time-per-output-token (TPOT) across different QPS volumes. For fp16 70B we run across 8 GPUs while in fp8 we only use 4 and we still maintain the same TPOT range. The 8B models are run across 1 GPU though fp8 is noticeably faster.

![TPOT vs QPS](assets/ttft_vs_qps_chart.png "TTFT vs QPS")
Looking at our time-to-first-token (TTFT), we observe the same trends. Even though the fp8 70B is run across half as many GPUs, its TTFT is roughly the same as the fp16 version on 8.

# Converting Llama3 models to FP8
If you wish to convert your existing finetunes to FP8, we can easily achieve this using [llmcompressor](https://github.com/vllm-project/llm-compressor).
```bash
uv add llmcompressor
uv run convert_hf_to_fp8.py NousResearch/Hermes-3-Llama-3.1-70B
```

To use the converted model, update `$MODEL` to your absolute path for the converted version, then rerun `uv run vllm serve $MODEL --enable-chunked-prefill --disable-log-requests --tensor-parallel-size $TP_SIZE`. Now, we have a vLLM server up with our converted finetune and can rerun our previous benchmarks to verify performance.

# Cleaning up
To clean up the resources we've provisioned, we can simply run `terraform destroy` from within this repository on your local machine.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 77f929f

Please sign in to comment.