Skip to content

Commit

Permalink
Merge pull request #3 from software-mansion/fix/export-script
Browse files Browse the repository at this point in the history
Fix export script
  • Loading branch information
jakmro authored Nov 8, 2024
2 parents ff66622 + 02b1ca8 commit ad8df24
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 29 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ https://docs.swmansion.com/react-native-executorch

## Examples 📲

We currently host a single example demonstrating a chat app built with the latest **LLaMa 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
We currently host a single example demonstrating a chat app built with the latest **Llama 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:

```bash
yarn
Expand Down
42 changes: 29 additions & 13 deletions docs/docs/guides/exporting-llama.mdx
Original file line number Diff line number Diff line change
@@ -1,23 +1,39 @@
---
title: Exporting LLaMa
title: Exporting Llama
sidebar_position: 2
---

In order to make the process of export as simple as possible for you, we created a script that runs a Docker container and exports the model.

1. Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official LLaMa website](https://www.llama.com/llama-downloads/).
2. Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
- [LLaMa 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
- [LLaMa 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original)
- [LLaMa 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
- [LLaMa 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
- [LLaMa 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
- [LLaMa 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)
3. Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory. Sometimes the files might have other names, for example `original_params.json`.
4. Run `mv tokenizer.model tokenizer.bin`. The library expects the tokenizers to have .bin extension.
5. Run `./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json script that's located in the `llama-export` directory.
6. The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.
## Steps to export Llama
### 1. Create an Account:
Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official Llama website](https://www.llama.com/llama-downloads/).

### 2. Select a Model:
Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
- [Llama 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
- [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main/original)
- [Llama 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
- [Llama 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
- [Llama 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
- [Llama 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)

### 3. Download Files:
Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory.

### 4. Rename the Tokenizer File:
Rename the `tokenizer.model` file to `tokenizer.bin` as required by the library:
```bash
mv tokenizer.model tokenizer.bin
```

### 5. Run the Export Script:
Navigate to the `llama_export` directory and run the following command:
```bash
./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json
```

The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.

:::note[Note]
This Docker image was tested on MacOS with ARM chip. This might not work in other environments.
Expand Down
6 changes: 3 additions & 3 deletions docs/docs/guides/running-llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ title: Running LLMs
sidebar_position: 1
---

React Native ExecuTorch supports LLaMa 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:

- For your convienience, it's best if you use models exported by us, you can get them from our hugging face repository. You can also use [constants](https://github.com/software-mansion/react-native-executorch/tree/main/src/modelUrls.ts) shipped with our library.
- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting LLaMa](./exporting-llama.mdx)
- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/cbfdf78f8/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself
- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting Llama](./exporting-llama.mdx)
- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/fe20be98c/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself

## Initializing

Expand Down
2 changes: 1 addition & 1 deletion llama_export/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 &
# Install ExecuTorch
RUN git clone https://github.com/pytorch/executorch.git
WORKDIR /executorch
RUN git checkout cbfdf78f8
RUN git checkout fe20be98c
RUN git submodule sync
RUN git submodule update --init

Expand Down
27 changes: 16 additions & 11 deletions llama_export/scripts/export_llama.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,34 @@

set -eu

# The quantized versions of LLaMa should cointain a quantization_args key in params.json
QUANTIZED=$(grep "lora_args" /model/params.json)

export_cmd="python -m examples.models.llama.export_llama \
--checkpoint /model/consolidated.00.pth \
--params /model/params.json \
-kv \
--use_sdpa_with_kv_cache \
-X \
-d bf16 \
--max_seq_length 2048 \
--metadata '{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}' \
--output_name=/outputs/llama3_2.pte"

if [ -n "$QUANTIZED" ]; then
# The quantized versions of Llama should cointain a quantization_args key in params.json
if grep -q "quantization_args" /model/params.json; then
export_cmd="${export_cmd//-d bf16/-d fp32}"
export_cmd+=" \
-qat \
-lora 16 \
--preq_mode 8da4w_output_8da8w \
--preq_group_size 32 \
--max_seq_length 2048 \
--xnnpack-extended-ops \
--preq_embedding_quantize 8,0"
--preq_mode 8da4w_output_8da8w \
--preq_group_size 32 \
--xnnpack-extended-ops \
--preq_embedding_quantize 8,0"

if grep -q "lora_args" /model/params.json; then
export_cmd+=" \
-qat \
-lora 16"
else # SpinQuant
export_cmd+=" \
--use_spin_quant native"
fi
fi

if ! eval "$export_cmd"; then
Expand Down

0 comments on commit ad8df24

Please sign in to comment.