Merge pull request #3 from khulnasoft-lab/dev

init commit
khulnasoft-lab · Feb 2, 2024 · 0b502bf · 0b502bf
2 parents 6230da1 + dac361d
commit 0b502bf
Show file tree

Hide file tree

Showing 26 changed files with 69 additions and 68 deletions.
diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ StartChat's core features include:
 ### Method 1: With pip
 
 ```bash
-pip3 install "fschat[model_worker,webui]"
+pip3 install "startchat[model_worker,webui]"
 ```
 
 ### Method 2: From source
@@ -120,7 +120,7 @@ python3 -m startchat.serve.cli --model-path lmsys/vicuna-7b-v1.5
 ```
 
 #### Multiple GPUs
-You can use model parallelism to aggregate GPU memory from multiple GPUs on the same machine. 
+You can use model parallelism to aggregate GPU memory from multiple GPUs on the same machine.
 ```
 python3 -m startchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --num-gpus 2
 ```
@@ -270,7 +270,7 @@ See [startchat/serve/huggingface_api.py](startchat/serve/huggingface_api.py).
 See [docs/langchain_integration](docs/langchain_integration.md).
 
 ## Evaluation
-We use MT-bench, a set of challenging multi-turn open-ended questions to evaluate models. 
+We use MT-bench, a set of challenging multi-turn open-ended questions to evaluate models.
 To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
 See instructions for running MT-bench at [startchat/llm_judge](startchat/llm_judge).
 

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -3,5 +3,5 @@ FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04
 RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
 RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
 RUN python3.9 get-pip.py
-RUN pip3 install fschat
-RUN pip3 install fschat[model_worker,webui] pydantic==1.10.13
+RUN pip3 install startchat
+RUN pip3 install startchat[model_worker,webui] pydantic==1.10.13
diff --git a/docs/awq.md b/docs/awq.md
@@ -16,7 +16,7 @@ git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
 cd repositories/llm-awq
 pip install -e .             # install awq package
 
-cd awq/kernels				
+cd awq/kernels
 python setup.py install	     # install awq CUDA kernels
 ```
 
@@ -32,12 +32,12 @@ git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
 python3 -m startchat.serve.cli \
     --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
     --awq-wbits 4 \
-    --awq-groupsize 128 
+    --awq-groupsize 128
 ```
 
 ## Benchmark
 
-* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128. 
+* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.
 
 * Benchmark on NVIDIA RTX A6000:
 

diff --git a/docs/commands/conv_release.md b/docs/commands/conv_release.md
@@ -35,4 +35,3 @@ python3 conv_release_scripts/sample.py
 
 
 ## Prompt distribution
-
diff --git a/docs/exllama_v2.md b/docs/exllama_v2.md
@@ -43,7 +43,7 @@ python3 -m startchat.serve.model_worker \
 
 `--exllama-cache-8bit` can be used to enable 8-bit caching with exllama and save some VRAM.
 
-## Performance 
+## Performance
 
 Reference: https://github.com/turboderp/exllamav2#performance
 

diff --git a/docs/openai_api.md b/docs/openai_api.md
@@ -105,7 +105,7 @@ curl http://localhost:8000/v1/embeddings \
   }'
 ```
 
-### Running multiple 
+### Running multiple
 
 If you want to run multiple models on the same machine and in the same process,
 you can replace the `model_worker` step above with a multi model variant:
@@ -130,7 +130,7 @@ This OpenAI-compatible API server supports LangChain. See [LangChain Integration
 ## Adjusting Environment Variables
 
 ### Timeout
-By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable: 
+By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:
 
 ```bash
 export STARTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds>

diff --git a/docs/training.md b/docs/training.md
@@ -24,7 +24,7 @@ torchrun --nproc_per_node=4 --master_port=9778 startchat/train/train_flant5.py \
     --tf32 True \
     --model_max_length 2048 \
     --preprocessed_path ./preprocessed_data/processed.json \
-    --gradient_checkpointing True 
+    --gradient_checkpointing True
 ```
 
 After training, please use our post-processing [function](https://github.com/khulnasoft-lab/StartChat/blob/55051ad0f23fef5eeecbda14a2e3e128ffcb2a98/startchat/utils.py#L166-L185) to update the saved model weight. Additional discussions can be found [here](https://github.com/khulnasoft-lab/StartChat/issues/643).
@@ -85,7 +85,7 @@ deepspeed startchat/train/train_lora_t5.py \
         --gradient_checkpointing True \
         --q_lora True     \
         --deepspeed playground/deepspeed_config_s2.json
-        
+
 ```
 
 ### Fine-tuning Vicuna-7B with Local NPUs

diff --git a/docs/vicuna_weights_version.md b/docs/vicuna_weights_version.md
@@ -55,7 +55,7 @@ You can add our delta to the original LLaMA weights to obtain the Vicuna weights
 2. Use the following scripts to get Vicuna weights by applying our delta. They will automatically download delta weights from our Hugging Face [account](https://huggingface.co/lmsys).
 
 **NOTE**:
-Weights v1.1 are only compatible with ```transformers>=4.28.0``` and ``fschat >= 0.2.0``.
+Weights v1.1 are only compatible with ```transformers>=4.28.0``` and ``startchat >= 0.2.0``.
 Please update your local packages accordingly. If you follow the above commands to do a fresh install, then you should get all the correct versions.
 
 #### Vicuna-7B

diff --git a/docs/xFasterTransformer.md b/docs/xFasterTransformer.md
@@ -21,7 +21,7 @@ python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}
 --enable-xft to enable xfastertransformer in Startchat
 --xft-max-seq-len to set the max token length the model can process. max token length include input token length.
 --xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to [this link](https://github.com/intel/xFasterTransformer/wiki/Data-Type-Support-Platform)
-    
+
 
 Chat with the CLI:
 ```bash
@@ -45,7 +45,7 @@ or using MPI to run inference on 2 sockets for better performance
 #run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
 OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
 -n 1 numactl -N 0  --localalloc \
-python -m startchat.serve.cli \ 
+python -m startchat.serve.cli \
     --model-path /path/to/models/chatglm2_6b_cpu/ \
     --enable-xft \
     --xft-dtype bf16_fp16 : \
@@ -63,15 +63,15 @@ Start model worker:
 python3 -m startchat.serve.model_worker \
     --model-path /path/to/models \
     --enable-xft \
-    --xft-dtype bf16_fp16 
+    --xft-dtype bf16_fp16
 ```
 or with numactl on multi-socket server for better performance
 ```bash
 #run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
 numactl -N 0  --localalloc python3 -m startchat.serve.model_worker \
     --model-path /path/to/models \
     --enable-xft \
-    --xft-dtype bf16_fp16 
+    --xft-dtype bf16_fp16
 ```
 or using MPI to run inference on 2 sockets for better performance
 ```bash
@@ -84,7 +84,7 @@ OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
 -n 1 numactl -N 1  --localalloc  python -m startchat.serve.model_worker \
     --model-path /path/to/models \
     --enable-xft \
-    --xft-dtype bf16_fp16 
+    --xft-dtype bf16_fp16
 ```
 
-For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run) 
+For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run)
diff --git a/fine_tune_requirements.txt b/fine_tune_requirements.txt
@@ -1,28 +1,28 @@
-peft==0.5.0 
-transformers==4.37.1 
-transformers-stream-generator==0.0.4 
-deepspeed==0.9.5 
-accelerate==0.26.1 
-gunicorn==20.1.0 
-flask==2.2.5 
-flask_api==3.1 
-langchain==0.1.4 
-fastapi==0.89.1 
-uvicorn==0.19.0 
-jinja2==3.1.2 
+peft==0.5.0
+transformers==4.37.1
+transformers-stream-generator==0.0.4
+deepspeed==0.9.5
+accelerate==0.26.1
+gunicorn==20.1.0
+flask==2.1.2
+flask_api==3.1
+langchain==0.1.4
+fastapi==0.89.1
+uvicorn==0.19.0
+jinja2==3.1.2
 huggingface_hub==0.20.3
-grpcio-tools==1.48.2 
-bitsandbytes==0.42.0 
-sentencepiece==0.1.99 
-safetensors==0.4.2 
-datasets==2.16.1 
-texttable==1.7.0 
-toml==0.10.2  
-numpy==1.24.4 
-scikit-learn==1.3.0 
-loguru==0.7.0 
-protobuf==3.20.3 
-pydantic==1.10.7 
-python-dotenv==1.0.0 
-tritonclient[all]==2.41.1 
-sse-starlette==2.0.0 
+grpcio-tools==1.48.2
+bitsandbytes==0.42.0
+sentencepiece==0.1.99
+safetensors==0.4.2
+datasets==2.16.1
+texttable==1.7.0
+toml==0.10.2
+numpy==1.24.4
+scikit-learn==1.3.0
+loguru==0.7.0
+protobuf==3.20.3
+pydantic==1.10.7
+python-dotenv==1.0.0
+tritonclient[all]==2.41.1
+sse-starlette==2.0.0
diff --git a/playground/deepspeed_config_s2.json b/playground/deepspeed_config_s2.json
@@ -6,10 +6,10 @@
     },
     "contiguous_gradients": true,
     "overlap_comm": true
-  },  
+  },
   "fp16": {
     "enabled": "auto"
   },
   "train_micro_batch_size_per_gpu": "auto",
   "gradient_accumulation_steps": "auto"
-}
+}
diff --git a/playground/deepspeed_config_s3.json b/playground/deepspeed_config_s3.json
@@ -29,4 +29,4 @@
     "train_batch_size": "auto",
     "train_micro_batch_size_per_gpu": "auto",
     "gradient_accumulation_steps": "auto"
-}
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["setuptools>=61.0"]
 build-backend = "setuptools.build_meta"
 
 [project]
-name = "fschat"
+name = "startchat"
 version = "0.2.35"
 description = "An open platform for training, serving, and evaluating large language model based chatbots."
 readme = "README.md"

diff --git a/replit.nix b/replit.nix
@@ -4,4 +4,4 @@
     pkgs.docker-compose_1
     pkgs.docker-compose
   ];
-}
+}
diff --git a/scripts/build-api.sh b/scripts/build-api.sh
@@ -7,7 +7,7 @@ PROJECT_DIR="$(pwd)"
 CONDA_ENV_NAME="startchat" #
 
 MODEL_PATH="HuggingFaceH4/zephyr-7b-beta" #beta is better than the alpha version, base model w/o quantization
-MODEL_PATH="lmsys/vicuna-7b-v1.5" 
+MODEL_PATH="lmsys/vicuna-7b-v1.5"
 
 API_HOST="0.0.0.0"
 API_PORT_NUMBER=8000
@@ -47,7 +47,7 @@ for screen in "${SCREENNAMES[@]}"; do
     # also activate the conda compute environment for these
     screen -DRRS "$screen" -X stuff "conda deactivate \r"
     screen -DRRS "$screen" -X stuff "conda activate $CONDA_ENV_NAME \r"
-    
+
 done
 
 

diff --git a/scripts/train_vicuna_13b.sh b/scripts/train_vicuna_13b.sh
@@ -23,4 +23,3 @@ torchrun --nproc_per_node=8 --master_port=20001 startchat/train/train_mem.py \
     --model_max_length 2048 \
     --gradient_checkpointing True \
     --lazy_preprocess True
-
diff --git a/scripts/train_vicuna_7b.sh b/scripts/train_vicuna_7b.sh
@@ -23,4 +23,3 @@ torchrun --nproc_per_node=4 --master_port=20001 startchat/train/train_mem.py \
     --model_max_length 2048 \
     --gradient_checkpointing True \
     --lazy_preprocess True
-
diff --git a/startchat/conversation.py b/startchat/conversation.py
@@ -1397,7 +1397,7 @@ def get_conv_template(name: str) -> Conversation:
     Conversation(
         name="metharme",
         system_template="<|system|>{system_message}",
-        system_message="""Enter RP mode. You shall reply to the user while staying 
+        system_message="""Enter RP mode. You shall reply to the user while staying
         in character. Your responses must be detailed, creative, immersive, and drive the scenario
         forward.""",
         roles=("<|user|>", "<|model|>"),

diff --git a/startchat/llm_judge/README.md b/startchat/llm_judge/README.md
@@ -164,7 +164,7 @@ This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8
 Please cite the following paper if you find the code or datasets helpful.
 ```
 @misc{zheng2023judging,
-      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, 
+      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
       author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
       year={2023},
       eprint={2306.05685},

diff --git a/startchat/llm_judge/gen_api_answer.py b/startchat/llm_judge/gen_api_answer.py
@@ -21,7 +21,10 @@
     chat_completion_palm,
 )
 from startchat.llm_judge.gen_model_answer import reorg_answer_file
-from startchat.model.model_adapter import get_conversation_template, ANTHROPIC_MODEL_LIST
+from startchat.model.model_adapter import (
+    get_conversation_template,
+    ANTHROPIC_MODEL_LIST,
+)
 
 
 def get_answer(

diff --git a/startchat/model/model_adapter.py b/startchat/model/model_adapter.py
@@ -704,7 +704,7 @@ def raise_warning_for_old_weights(self, model):
                 "current startchat.\nYou can try one of the following methods:\n"
                 "1. Upgrade your weights to the new Vicuna-v1.3: https://github.com/khulnasoft-lab/StartChat#vicuna-weights.\n"
                 "2. Use the old conversation template by `python3 -m startchat.serve.cli --model-path /path/to/vicuna-v0 --conv-template one_shot`\n"
-                "3. Downgrade fschat to fschat==0.1.10 (Not recommended).\n"
+                "3. Downgrade startchat to startchat==0.1.10 (Not recommended).\n"
             )
 
 

diff --git a/startchat/serve/gateway/nginx.conf b/startchat/serve/gateway/nginx.conf
@@ -86,7 +86,7 @@ http {
                         proxy_set_header Connection "Upgrade";  # set the Connection header to Upgrade to enable WebSocket communication
                 }
         }
-	
+
 	# the following block routes all HTTP traffic to HTTPS via nginx
 	server {
 		listen 80;

diff --git a/startchat/serve/launch_all_serve.py b/startchat/serve/launch_all_serve.py
@@ -1,9 +1,9 @@
 """
-Usage: python launch_all_serve_by_shell.py --model-path-address "THUDM/chatglm2-6b@localhost@2021" "huggyllama/llama-7b@localhost@2022" 
+Usage: python launch_all_serve_by_shell.py --model-path-address "THUDM/chatglm2-6b@localhost@2021" "huggyllama/llama-7b@localhost@2022"
 
-Workers are listed in format of `model-path`@`host`@`port` 
+Workers are listed in format of `model-path`@`host`@`port`
 
-The key mechanism behind this scripts is: 
+The key mechanism behind this scripts is:
     1, execute shell cmd to launch the controller/worker/openai-api-server;
     2, check the log of controller/worker/openai-api-server to ensure that the serve is launched properly.
 Note that a few of non-critical `startchat.serve` cmd options are not supported currently.

diff --git a/startchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/instructions.md b/startchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/instructions.md
@@ -20,4 +20,3 @@ python3 compute_stats.py --in $BASE.s1.json --scale $SCALE
 # Copy figures
 scp "atlas:/data/lmzheng/StartChat/startchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/*.pdf" .
 ```
-
diff --git a/startchat/serve/shutdown_serve.py b/startchat/serve/shutdown_serve.py
@@ -1,7 +1,7 @@
 """
 Usage：
 python shutdown_serve.py --down all
-options: "all","controller","model_worker","openai_api_server"， `all` means to stop all related servers 
+options: "all","controller","model_worker","openai_api_server"， `all` means to stop all related servers
 """
 
 import argparse

diff --git a/startchat/train/llama2_flash_attn_monkey_patch.py b/startchat/train/llama2_flash_attn_monkey_patch.py
@@ -142,7 +142,9 @@ def replace_llama_attn_with_flash_attn():
 
 
 def test():
-    from startchat.train.llama_flash_attn_monkey_patch import forward as startchat_forward
+    from startchat.train.llama_flash_attn_monkey_patch import (
+        forward as startchat_forward,
+    )
     from transformers.models.llama.configuration_llama import LlamaConfig
 
     config = LlamaConfig(
Original file line number	Diff line number	Diff line change
Expand Up		@@ -35,4 +35,3 @@ python3 conv_release_scripts/sample.py


		## Prompt distribution
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,4 +4,4 @@ @@
         pkgs.docker-compose_1
         pkgs.docker-compose
       ];
-    }
+    }
Original file line number	Diff line number	Diff line change
Expand Up		@@ -23,4 +23,3 @@ torchrun --nproc_per_node=8 --master_port=20001 startchat/train/train_mem.py \
		--model_max_length 2048 \
		--gradient_checkpointing True \
		--lazy_preprocess True
Original file line number	Diff line number	Diff line change
Expand Up		@@ -20,4 +20,3 @@ python3 compute_stats.py --in $BASE.s1.json --scale $SCALE
		# Copy figures
		scp "atlas:/data/lmzheng/StartChat/startchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/*.pdf" .
		```