Merge pull request #3 from software-mansion/fix/export-script

Fix export script
software-mansion · Nov 8, 2024 · ad8df24 · ad8df24
2 parents ff66622 + 02b1ca8
commit ad8df24
Show file tree

Hide file tree

Showing 5 changed files with 50 additions and 29 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ https://docs.swmansion.com/react-native-executorch
 
 ## Examples 📲
 
-We currently host a single example demonstrating a chat app built with the latest **LLaMa 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
+We currently host a single example demonstrating a chat app built with the latest **Llama 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
 
 ```bash
 yarn

diff --git a/docs/docs/guides/exporting-llama.mdx b/docs/docs/guides/exporting-llama.mdx
@@ -1,23 +1,39 @@
 ---
-title: Exporting LLaMa
+title: Exporting Llama
 sidebar_position: 2
 ---
 
 In order to make the process of export as simple as possible for you, we created a script that runs a Docker container and exports the model. 
 
-1. Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official LLaMa website](https://www.llama.com/llama-downloads/). 
-2. Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
-   - [LLaMa 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
-   - [LLaMa 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original)
-   - [LLaMa 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
-   - [LLaMa 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
-   - [LLaMa 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
-   - [LLaMa 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)
-3. Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory. Sometimes the files might have other names, for example `original_params.json`.
-4. Run `mv tokenizer.model tokenizer.bin`. The library expects the tokenizers to have .bin extension.
-5. Run `./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json script that's located in the `llama-export` directory. 
-6. The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.
+## Steps to export Llama
+### 1. Create an Account:
+Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official Llama website](https://www.llama.com/llama-downloads/). 
 
+### 2. Select a Model:
+Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
+   - [Llama 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
+   - [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main/original)
+   - [Llama 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
+   - [Llama 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
+   - [Llama 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
+   - [Llama 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)
+
+### 3. Download Files:
+Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory.
+
+### 4. Rename the Tokenizer File:
+Rename the `tokenizer.model` file to `tokenizer.bin` as required by the library:
+```bash
+mv tokenizer.model tokenizer.bin
+```
+
+### 5. Run the Export Script:
+Navigate to the `llama_export` directory and run the following command: 
+```bash
+./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json
+```
+
+The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.
 
 :::note[Note]
 This Docker image was tested on MacOS with ARM chip. This might not work in other environments.

diff --git a/docs/docs/guides/running-llms.md b/docs/docs/guides/running-llms.md
@@ -3,11 +3,11 @@ title: Running LLMs
 sidebar_position: 1
 ---
 
-React Native ExecuTorch supports LLaMa 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
+React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
 
 - For your convienience, it's best if you use models exported by us, you can get them from our hugging face repository. You can also use [constants](https://github.com/software-mansion/react-native-executorch/tree/main/src/modelUrls.ts) shipped with our library.
-- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting LLaMa](./exporting-llama.mdx)
-- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/cbfdf78f8/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself
+- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting Llama](./exporting-llama.mdx)
+- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/fe20be98c/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself
 
 ## Initializing
 

diff --git a/llama_export/Dockerfile b/llama_export/Dockerfile
@@ -50,7 +50,7 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 &
 # Install ExecuTorch
 RUN git clone https://github.com/pytorch/executorch.git
 WORKDIR /executorch
-RUN git checkout cbfdf78f8
+RUN git checkout fe20be98c
 RUN git submodule sync
 RUN git submodule update --init
 

diff --git a/llama_export/scripts/export_llama.sh b/llama_export/scripts/export_llama.sh
@@ -2,29 +2,34 @@
 
 set -eu
 
-# The quantized versions of LLaMa should cointain a quantization_args key in params.json
-QUANTIZED=$(grep "lora_args" /model/params.json)
-
 export_cmd="python -m examples.models.llama.export_llama \
     --checkpoint /model/consolidated.00.pth \
     --params /model/params.json \
     -kv \
     --use_sdpa_with_kv_cache \
     -X \
     -d bf16 \
+    --max_seq_length 2048 \
     --metadata '{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}' \
     --output_name=/outputs/llama3_2.pte"
 
-if [ -n "$QUANTIZED" ]; then
+# The quantized versions of Llama should cointain a quantization_args key in params.json
+if grep -q "quantization_args" /model/params.json; then
     export_cmd="${export_cmd//-d bf16/-d fp32}"
     export_cmd+=" \
-      -qat \
-      -lora 16 \
-      --preq_mode 8da4w_output_8da8w \
-      --preq_group_size 32 \
-      --max_seq_length 2048 \
-      --xnnpack-extended-ops \
-      --preq_embedding_quantize 8,0"
+        --preq_mode 8da4w_output_8da8w \
+        --preq_group_size 32 \
+        --xnnpack-extended-ops \
+        --preq_embedding_quantize 8,0"
+
+    if grep -q "lora_args" /model/params.json; then
+        export_cmd+=" \
+            -qat \
+            -lora 16"
+    else # SpinQuant
+        export_cmd+=" \
+            --use_spin_quant native"
+    fi
 fi
 
 if ! eval "$export_cmd"; then
-Original file line number
+Diff line change
@@ Expand Up / @@ -17,7 +17,7 @@ https://docs.swmansion.com/react-native-executorch @@
     ## Examples 📲
-    We currently host a single example demonstrating a chat app built with the latest **LLaMa 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
+    We currently host a single example demonstrating a chat app built with the latest **Llama 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
     ```bash
     yarn
@@ Expand Down @@