Merge pull request #43 from pjlab-sys4nlp/data_mix

PUBLISH: filename refactors and readme preparation
pjlab-sys4nlp · Dec 24, 2023 · 251405e · 251405e
2 parents 8245759 + 96f04b8
commit 251405e
Show file tree

Hide file tree

Showing 84 changed files with 219 additions and 50,200 deletions.
diff --git a/.gitignore b/.gitignore
@@ -173,3 +173,4 @@ smoe/utils/gpu_diag.py
 /logs/
 /logs-cpt/
 /tensorboard/
+models/
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
   <span style="color:red">📢 <strong><i>A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!</i></strong></span>
   <div>
     <a href="https://huggingface.co/llama-moe" target="_blank">🤗 Model Weights</a> | <a href="#" target="_blank">📃 Technical Report</a> | <a href="#quick-start">🚀 Quick Start</a><br />
-    <a href="docs/Installation.md">⚙️ Installation Guide</a> | <a href="#expert-construction">🚧 Expert Construction</a> | <a href="#continual-pretraining">🚅 Continual Pre-training</a> | <a href="#evaluation">💎 Evaluation</a>
+    <a href="#installation">⚙️ Installation Guide</a> | <a href="#expert-construction">🚧 Expert Construction</a> | <a href="#continual-pretraining">🚅 Continual Pre-training</a> | <a href="#evaluation">💎 Evaluation</a>
   </div>
 </div>
 
@@ -19,7 +19,7 @@ We build LLaMA-MoE with the following two steps:
 
 <h2 id="features">🔥 Features</h2>
 
-1. **Lightweight Models**: The total number of model parameters is only 6.7B, which is friendly for deployment and research usage.
+1. **Lightweight Models**: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
 2. **Multiple Expert Construction Methods**:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient ([Zhang et al., 2022](http://arxiv.org/abs/2110.01786), [Zuo et al., 2022](http://arxiv.org/abs/2204.07675))
    2. Neuron-Sharing: Inner, Inter (residual)
@@ -42,6 +42,8 @@ We build LLaMA-MoE with the following two steps:
 <h2 id="quick-start">🚀 QuickStart</h2>
 
 ```python
+# python>=3.10
+
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 
@@ -60,6 +62,26 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 # Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three
 ```
 
+<h2 id="installation">⚙️ Installation</h2>
+
+1. Prepare conda environment: `conda create -n smoe python=3.11` (If your environment name is not `smoe`, you may need to change environment in launching scripts)
+2. Add correct environment variables in `~/.bashrc` (`gcc` is set to newer version for installing `flash-attn`). e.g.:
+    ```bash
+    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
+    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
+    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
+    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
+    ```
+3. Take the variables into effect: `source ~/.bashrc`
+4. Install PyTorch (CUDA-11.8): `pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`
+5. Install dependencies: `pip install -r requirements.txt`
+6. Install `flash-attn`: `pip install flash-attn==2.0.1 --no-build-isolation`. You may need to follow the [flash-attn installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to avoid some errors.
+7. Install the latest Git: `conda install git`
+8. Clone the repo: `git clone [email protected]:pjlab-sys4nlp/llama-moe.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)
+9. Change current directory: `cd llama-moe`
+10. Install `smoe` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-e): `pip install -e .[dev]`
+11. Setup `pre-commit` hooks: `pre-commit install`
+
 <h2 id="performance">📊 Model Performance</h2>
 
 | Model                     | \#Activated Experts | \#Experts | \#Activated Params |                                   Links                                   |
@@ -83,13 +105,13 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
 <h2 id="expert-construction">🚧 Expert Construction</h2>
 
 - Neuron-Independent
-  - Independent<sub>Random</sub>: `bash ./scripts/moefication/split/run_split_random.sh`
-  - Independent<sub>Clustering</sub>: `bash ./scripts/moefication/split/run_split_clustering.sh`
+  - Independent<sub>Random</sub>: `bash ./scripts/expert_construction/split/run_split_random.sh`
+  - Independent<sub>Clustering</sub>: `bash ./scripts/expert_construction/split/run_split_clustering.sh`
 - Neuron-Sharing
-  - Sharing<sub>Inner</sub>: `bash ./scripts/moefication/split/run_split_gradient.sh`
-  - Sharing<sub>Inter</sub>: `bash ./scripts/moefication/split/run_split_gradient_residual.sh`
+  - Sharing<sub>Inner</sub>: `bash ./scripts/expert_construction/split/run_split_gradient.sh`
+  - Sharing<sub>Inter</sub>: `bash ./scripts/expert_construction/split/run_split_gradient_residual.sh`
 
-For more information, please refer to [Expert Construction docs](docs/moefication/README.md).
+For more information, please refer to [Expert Construction docs](docs/expert_construction/README.md).
 
 <h2 id="continual-pretraining">🚅 Continual Pre-training</h2>
 

diff --git a/docs/Installation.md b/docs/Installation.md
@@ -13,7 +13,7 @@
 5. Install dependencies: `pip install -r requirements.txt`
 6. Install `flash-attn`: `pip install flash-attn==2.0.1 --no-build-isolation`. You may need to follow the [flash-attn installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to avoid some errors.
 7. Install the latest Git: `conda install git`
-8. Clone the repo: `git clone [email protected]:pjlab-sys4nlp/train-moe.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)
-9. Change current directory: `cd train-moe`
+8. Clone the repo: `git clone [email protected]:pjlab-sys4nlp/llama-moe.git` (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the [docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) about it.)
+9. Change current directory: `cd llama-moe`
 10. Install `smoe` in [editable mode](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-e): `pip install -e .[dev]`
 11. Setup `pre-commit` hooks: `pre-commit install`
diff --git a/docs/continual_pretraining/README.md b/docs/continual_pretraining/README.md
@@ -58,7 +58,7 @@ model_type="llama"
 pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B
 ```
 
-For LLaMA with MoEfication, use the following settings:
+For LLaMA-MoE, use the following settings:
 ```bash
 model_type="llama_moe"
 pretrained_model=/mnt/petrelfs/share_data/quxiaoye/models/llama_7B_MoE_16Select4-l2_norm

diff --git a/docs/moefication/README.md → docs/expert_construction/README.md b/docs/moefication/README.md → docs/expert_construction/README.md
@@ -1,4 +1,4 @@
-# MoEfication of LLaMA Model
+# Expert Construction of LLaMA-MoE
 
 This documentation provides the procedures to convert a LLaMA model to LLaMA-MoE.
 
@@ -20,15 +20,15 @@ The conversion from LLaMA to LLaMA-MoE consists of two steps:
 To randomly split the intermediate neurons in FFNs, you can run:
 
 ```shell
-bash ./scripts/moefication/split/run_split_random.sh
+bash ./scripts/expert_construction/split/run_split_random.sh
 ```
 
 Remember to change the following variables:
 
 ```shell
 num_experts="" # number of experts in each MoE layer
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 save_path="" # path to save the indices sets
 ```
 
@@ -39,15 +39,15 @@ save_path="" # path to save the indices sets
 To split the intermediate neurons in FFNs by k-means clustering, you can run:
 
 ```shell
-bash ./scripts/moefication/split/run_split_clustering.sh
+bash ./scripts/expert_construction/split/run_split_clustering.sh
 ```
 
 Remember to change the following variables:
 
 ```shell
 num_experts="" # number of experts in each MoE layer
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 save_path="" # path to save the indices sets
 
 metric="" # metric for clustering, choices: `l2` `cos`
@@ -65,15 +65,15 @@ We also implenmented the co-activation graph based method in [MoEfication](https
 You need to install [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/download) first. Then you can run to following script to perform splitting:
 
 ```shell
-bash ./scripts/moefication/split/run_split_graph.sh
+bash ./scripts/expert_construction/split/run_split_graph.sh
 ```
 
 Remember to change the following variables:
 
 ```shell
 num_experts="" # number of experts in each MoE layer
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 save_path="" # path to save the indices sets
 
 metric="" # metric to measure the sparsity, choices: `l1_norm` `l2_norm` `plain`
@@ -82,7 +82,7 @@ proj_type="" # weights to perform clustering, choices: `up_proj` `gate_proj`
 
 
 
-#### Gradient Split 
+#### Gradient Split
 
 Before performing gradient-based splitting (Eq. 8 in the technical report), you need to prepare a bunch of pretraining data and group them into different clusters by running:
 
@@ -93,15 +93,15 @@ python smoe/entrypoint/text_clustering.py
 Then, you need to run the following script to get the importance vector $v$ for the intermediate neurons in each layer:
 
 ```shell
-bash scripts/moefication/split/run_split_gradient_get_grads.sh
+bash scripts/expert_construction/split/run_split_gradient_get_grads.sh
 ```
 
 Remember to change the following variables:
 
 ```shell
 dataset_dir="" # path to clustered data
 pretrained_model="" # path to the LLaMA checkpoint
-tokenizer_path="" # path to the LLaMA tokenizer 
+tokenizer_path="" # path to the LLaMA tokenizer
 save_path="" # path to save the indices sets
 
 accumulate_level="" # should be set to `sample`
@@ -111,14 +111,14 @@ importance_type="" # should be set to `feature_change`
 
 
 
-##### Neuron Independent 
+##### Neuron Independent
 
 > This part is not included in our technical report.
 
 You can also split the intermediate neurons in a neuron-independent manner by treating the expert split as a task assignment problem. To perform the split, you can run:
 
 ```shell
-bash ./scripts/moefication/split/run_split_gradient.sh
+bash ./scripts/expert_construction/split/run_split_gradient.sh
 ```
 
 Remember to change the following variables:
@@ -128,7 +128,7 @@ expert_num="" # number of experts in each MoE layer
 expert_size="" # intermediate neurons in each expert
 share_neurons="False" ######### SET AS FLASE TO BE NEURON-INDEPENDENT #########
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 score_file_path="" # path to the score files generated above
 save_path="" # path to save the indices sets
 visualization_path="" # path to save the visualization results
@@ -144,7 +144,7 @@ proj_type="" # weights to perform clustering, choices: `up_proj` `gate_proj`
 Here we use the same entrance as the **Neuron Independent** strategy above for gradient split.
 
 ```shell
-bash ./scripts/moefication/split/run_split_gradient.sh
+bash ./scripts/expert_construction/split/run_split_gradient.sh
 ```
 
 Remember to change the following variables:
@@ -154,7 +154,7 @@ expert_num="" # number of experts in each MoE layer
 expert_size="" # intermediate neurons in each expert
 share_neurons="True" ######### SET AS TRUE TO BE INNER-SHARING #########
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 score_file_path="" # path to the score files generated above
 save_path="" # path to save the indices sets
 visualization_path="" # path to save the visualization results
@@ -170,7 +170,7 @@ proj_type="" # weights to perform clustering, choices: `up_proj` `gate_proj`
 You can run the following script to perform inter-sharing split:
 
 ```shell
-bash ./scripts/moefication/split/run_split_gradient_residual.sh
+bash ./scripts/expert_construction/split/run_split_gradient_residual.sh
 ```
 
 Remember to change the following variables:
@@ -181,7 +181,7 @@ expert_num_residual="" # number of residual experts
 expert_size="" # intermediate neurons in each expert
 share_neurons="" # Whether to share neurons in non-residual experts
 
-model_path="" # path to the LLaMA checkpoint 
+model_path="" # path to the LLaMA checkpoint
 score_file_path="" # path to the score files generated above
 save_path="" # path to save the indices sets
 visualization_path="" # path to save the visualization results
@@ -199,7 +199,7 @@ proj_type="" # weights to perform clustering, choices: `up_proj` `gate_proj`
 Run the following script:
 
 ```shell
-bash ./scripts/moefication/convert/run_convert.sh
+bash ./scripts/expert_construction/convert/run_convert.sh
 ```
 
 
@@ -209,7 +209,7 @@ bash ./scripts/moefication/convert/run_convert.sh
 Run the following script:
 
 ```shell
-bash ./scripts/moefication/convert/run_convert_gradient.sh
+bash ./scripts/expert_construction/convert/run_convert_gradient.sh
 ```
 
 
@@ -219,7 +219,7 @@ bash ./scripts/moefication/convert/run_convert_gradient.sh
 Run the following script:
 
 ```shell
-bash ./scripts/moefication/convert/run_convert_gradient_residual.sh
+bash ./scripts/expert_construction/convert/run_convert_gradient_residual.sh
 ```
 
 
@@ -229,18 +229,13 @@ bash ./scripts/moefication/convert/run_convert_gradient_residual.sh
 ```
 --smoe
 	-- scripts
-        -- moefication
+        -- expert_construction
             -- convert
             -- get_hidden_features (deprecated)
             -- prune (deprecated)
             -- select (deprecated)
             -- split
     -- smoe
         -- entrypoint
-            -- moefication
+            -- expert_construction
 ```
-
-
-
-
-
diff --git a/example.py b/example.py
@@ -1,11 +1,14 @@
-import torch
-from transformers import AutoTokenizer
+# python>=3.10
 
-from smoe.models.llama_moe import LlamaMoEForCausalLM
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
 
-model_dir = "/mnt/petrelfs/share_data/quxiaoye/runs/llama2_random_split_112gpus_16_2/outputs/cpt-llama2_random_split_112gpus_16_2_scale_factor_8-2342244/checkpoint-13600/"
-tokenizer = AutoTokenizer.from_pretrained(model_dir)
-model = LlamaMoEForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16)
+model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True
+)
+model.eval()
 model.to("cuda:0")
 
 input_text = "Suzhou is famous of"

diff --git a/models/124M/encoder.json b/models/124M/encoder.json
-Original file line number
+Diff line change
@@ Expand Up / @@ -173,3 +173,4 @@ smoe/utils/gpu_diag.py @@
     /logs/
     /logs-cpt/
     /tensorboard/
+    models/