Skip to content

Commit

Permalink
Merge pull request #3 from khulnasoft-lab/dev
Browse files Browse the repository at this point in the history
init commit
  • Loading branch information
gitworkflows authored Feb 2, 2024
2 parents 6230da1 + dac361d commit 0b502bf
Show file tree
Hide file tree
Showing 26 changed files with 69 additions and 68 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ StartChat's core features include:
### Method 1: With pip

```bash
pip3 install "fschat[model_worker,webui]"
pip3 install "startchat[model_worker,webui]"
```

### Method 2: From source
Expand Down Expand Up @@ -120,7 +120,7 @@ python3 -m startchat.serve.cli --model-path lmsys/vicuna-7b-v1.5
```

#### Multiple GPUs
You can use model parallelism to aggregate GPU memory from multiple GPUs on the same machine.
You can use model parallelism to aggregate GPU memory from multiple GPUs on the same machine.
```
python3 -m startchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --num-gpus 2
```
Expand Down Expand Up @@ -270,7 +270,7 @@ See [startchat/serve/huggingface_api.py](startchat/serve/huggingface_api.py).
See [docs/langchain_integration](docs/langchain_integration.md).

## Evaluation
We use MT-bench, a set of challenging multi-turn open-ended questions to evaluate models.
We use MT-bench, a set of challenging multi-turn open-ended questions to evaluate models.
To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
See instructions for running MT-bench at [startchat/llm_judge](startchat/llm_judge).

Expand Down
4 changes: 2 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04
RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python3.9 get-pip.py
RUN pip3 install fschat
RUN pip3 install fschat[model_worker,webui] pydantic==1.10.13
RUN pip3 install startchat
RUN pip3 install startchat[model_worker,webui] pydantic==1.10.13
6 changes: 3 additions & 3 deletions docs/awq.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e . # install awq package

cd awq/kernels
cd awq/kernels
python setup.py install # install awq CUDA kernels
```

Expand All @@ -32,12 +32,12 @@ git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
python3 -m startchat.serve.cli \
--model-path models/vicuna-7b-v1.3-4bit-g128-awq \
--awq-wbits 4 \
--awq-groupsize 128
--awq-groupsize 128
```

## Benchmark

* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.
* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

* Benchmark on NVIDIA RTX A6000:

Expand Down
1 change: 0 additions & 1 deletion docs/commands/conv_release.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,3 @@ python3 conv_release_scripts/sample.py


## Prompt distribution

2 changes: 1 addition & 1 deletion docs/exllama_v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ python3 -m startchat.serve.model_worker \

`--exllama-cache-8bit` can be used to enable 8-bit caching with exllama and save some VRAM.

## Performance
## Performance

Reference: https://github.com/turboderp/exllamav2#performance

Expand Down
4 changes: 2 additions & 2 deletions docs/openai_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ curl http://localhost:8000/v1/embeddings \
}'
```

### Running multiple
### Running multiple

If you want to run multiple models on the same machine and in the same process,
you can replace the `model_worker` step above with a multi model variant:
Expand All @@ -130,7 +130,7 @@ This OpenAI-compatible API server supports LangChain. See [LangChain Integration
## Adjusting Environment Variables

### Timeout
By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:
By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:

```bash
export STARTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds>
Expand Down
4 changes: 2 additions & 2 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ torchrun --nproc_per_node=4 --master_port=9778 startchat/train/train_flant5.py \
--tf32 True \
--model_max_length 2048 \
--preprocessed_path ./preprocessed_data/processed.json \
--gradient_checkpointing True
--gradient_checkpointing True
```

After training, please use our post-processing [function](https://github.com/khulnasoft-lab/StartChat/blob/55051ad0f23fef5eeecbda14a2e3e128ffcb2a98/startchat/utils.py#L166-L185) to update the saved model weight. Additional discussions can be found [here](https://github.com/khulnasoft-lab/StartChat/issues/643).
Expand Down Expand Up @@ -85,7 +85,7 @@ deepspeed startchat/train/train_lora_t5.py \
--gradient_checkpointing True \
--q_lora True \
--deepspeed playground/deepspeed_config_s2.json

```

### Fine-tuning Vicuna-7B with Local NPUs
Expand Down
2 changes: 1 addition & 1 deletion docs/vicuna_weights_version.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ You can add our delta to the original LLaMA weights to obtain the Vicuna weights
2. Use the following scripts to get Vicuna weights by applying our delta. They will automatically download delta weights from our Hugging Face [account](https://huggingface.co/lmsys).

**NOTE**:
Weights v1.1 are only compatible with ```transformers>=4.28.0``` and ``fschat >= 0.2.0``.
Weights v1.1 are only compatible with ```transformers>=4.28.0``` and ``startchat >= 0.2.0``.
Please update your local packages accordingly. If you follow the above commands to do a fresh install, then you should get all the correct versions.

#### Vicuna-7B
Expand Down
12 changes: 6 additions & 6 deletions docs/xFasterTransformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o ${OUTPUT_DIR}
--enable-xft to enable xfastertransformer in Startchat
--xft-max-seq-len to set the max token length the model can process. max token length include input token length.
--xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to [this link](https://github.com/intel/xFasterTransformer/wiki/Data-Type-Support-Platform)


Chat with the CLI:
```bash
Expand All @@ -45,7 +45,7 @@ or using MPI to run inference on 2 sockets for better performance
#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0 --localalloc \
python -m startchat.serve.cli \
python -m startchat.serve.cli \
--model-path /path/to/models/chatglm2_6b_cpu/ \
--enable-xft \
--xft-dtype bf16_fp16 : \
Expand All @@ -63,15 +63,15 @@ Start model worker:
python3 -m startchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
--xft-dtype bf16_fp16
```
or with numactl on multi-socket server for better performance
```bash
#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0 --localalloc python3 -m startchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
--xft-dtype bf16_fp16
```
or using MPI to run inference on 2 sockets for better performance
```bash
Expand All @@ -84,7 +84,7 @@ OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 1 --localalloc python -m startchat.serve.model_worker \
--model-path /path/to/models \
--enable-xft \
--xft-dtype bf16_fp16
--xft-dtype bf16_fp16
```

For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run)
For more details, please refer to [this link](https://github.com/intel/xFasterTransformer#how-to-run)
54 changes: 27 additions & 27 deletions fine_tune_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
peft==0.5.0
transformers==4.37.1
transformers-stream-generator==0.0.4
deepspeed==0.9.5
accelerate==0.26.1
gunicorn==20.1.0
flask==2.2.5
flask_api==3.1
langchain==0.1.4
fastapi==0.89.1
uvicorn==0.19.0
jinja2==3.1.2
peft==0.5.0
transformers==4.37.1
transformers-stream-generator==0.0.4
deepspeed==0.9.5
accelerate==0.26.1
gunicorn==20.1.0
flask==2.1.2
flask_api==3.1
langchain==0.1.4
fastapi==0.89.1
uvicorn==0.19.0
jinja2==3.1.2
huggingface_hub==0.20.3
grpcio-tools==1.48.2
bitsandbytes==0.42.0
sentencepiece==0.1.99
safetensors==0.4.2
datasets==2.16.1
texttable==1.7.0
toml==0.10.2
numpy==1.24.4
scikit-learn==1.3.0
loguru==0.7.0
protobuf==3.20.3
pydantic==1.10.7
python-dotenv==1.0.0
tritonclient[all]==2.41.1
sse-starlette==2.0.0
grpcio-tools==1.48.2
bitsandbytes==0.42.0
sentencepiece==0.1.99
safetensors==0.4.2
datasets==2.16.1
texttable==1.7.0
toml==0.10.2
numpy==1.24.4
scikit-learn==1.3.0
loguru==0.7.0
protobuf==3.20.3
pydantic==1.10.7
python-dotenv==1.0.0
tritonclient[all]==2.41.1
sse-starlette==2.0.0
4 changes: 2 additions & 2 deletions playground/deepspeed_config_s2.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
},
"contiguous_gradients": true,
"overlap_comm": true
},
},
"fp16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto"
}
}
2 changes: 1 addition & 1 deletion playground/deepspeed_config_s3.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto"
}
}
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "fschat"
name = "startchat"
version = "0.2.35"
description = "An open platform for training, serving, and evaluating large language model based chatbots."
readme = "README.md"
Expand Down
2 changes: 1 addition & 1 deletion replit.nix
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
pkgs.docker-compose_1
pkgs.docker-compose
];
}
}
4 changes: 2 additions & 2 deletions scripts/build-api.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ PROJECT_DIR="$(pwd)"
CONDA_ENV_NAME="startchat" #

MODEL_PATH="HuggingFaceH4/zephyr-7b-beta" #beta is better than the alpha version, base model w/o quantization
MODEL_PATH="lmsys/vicuna-7b-v1.5"
MODEL_PATH="lmsys/vicuna-7b-v1.5"

API_HOST="0.0.0.0"
API_PORT_NUMBER=8000
Expand Down Expand Up @@ -47,7 +47,7 @@ for screen in "${SCREENNAMES[@]}"; do
# also activate the conda compute environment for these
screen -DRRS "$screen" -X stuff "conda deactivate \r"
screen -DRRS "$screen" -X stuff "conda activate $CONDA_ENV_NAME \r"

done


Expand Down
1 change: 0 additions & 1 deletion scripts/train_vicuna_13b.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,3 @@ torchrun --nproc_per_node=8 --master_port=20001 startchat/train/train_mem.py \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True

1 change: 0 additions & 1 deletion scripts/train_vicuna_7b.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,3 @@ torchrun --nproc_per_node=4 --master_port=20001 startchat/train/train_mem.py \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True

2 changes: 1 addition & 1 deletion startchat/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1397,7 +1397,7 @@ def get_conv_template(name: str) -> Conversation:
Conversation(
name="metharme",
system_template="<|system|>{system_message}",
system_message="""Enter RP mode. You shall reply to the user while staying
system_message="""Enter RP mode. You shall reply to the user while staying
in character. Your responses must be detailed, creative, immersive, and drive the scenario
forward.""",
roles=("<|user|>", "<|model|>"),
Expand Down
2 changes: 1 addition & 1 deletion startchat/llm_judge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ This Colab [notebook](https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8
Please cite the following paper if you find the code or datasets helpful.
```
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
Expand Down
5 changes: 4 additions & 1 deletion startchat/llm_judge/gen_api_answer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
chat_completion_palm,
)
from startchat.llm_judge.gen_model_answer import reorg_answer_file
from startchat.model.model_adapter import get_conversation_template, ANTHROPIC_MODEL_LIST
from startchat.model.model_adapter import (
get_conversation_template,
ANTHROPIC_MODEL_LIST,
)


def get_answer(
Expand Down
2 changes: 1 addition & 1 deletion startchat/model/model_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,7 +704,7 @@ def raise_warning_for_old_weights(self, model):
"current startchat.\nYou can try one of the following methods:\n"
"1. Upgrade your weights to the new Vicuna-v1.3: https://github.com/khulnasoft-lab/StartChat#vicuna-weights.\n"
"2. Use the old conversation template by `python3 -m startchat.serve.cli --model-path /path/to/vicuna-v0 --conv-template one_shot`\n"
"3. Downgrade fschat to fschat==0.1.10 (Not recommended).\n"
"3. Downgrade startchat to startchat==0.1.10 (Not recommended).\n"
)


Expand Down
2 changes: 1 addition & 1 deletion startchat/serve/gateway/nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ http {
proxy_set_header Connection "Upgrade"; # set the Connection header to Upgrade to enable WebSocket communication
}
}

# the following block routes all HTTP traffic to HTTPS via nginx
server {
listen 80;
Expand Down
6 changes: 3 additions & 3 deletions startchat/serve/launch_all_serve.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
"""
Usage: python launch_all_serve_by_shell.py --model-path-address "THUDM/chatglm2-6b@localhost@2021" "huggyllama/llama-7b@localhost@2022"
Usage: python launch_all_serve_by_shell.py --model-path-address "THUDM/chatglm2-6b@localhost@2021" "huggyllama/llama-7b@localhost@2022"
Workers are listed in format of `model-path`@`host`@`port`
Workers are listed in format of `model-path`@`host`@`port`
The key mechanism behind this scripts is:
The key mechanism behind this scripts is:
1, execute shell cmd to launch the controller/worker/openai-api-server;
2, check the log of controller/worker/openai-api-server to ensure that the serve is launched properly.
Note that a few of non-critical `startchat.serve` cmd options are not supported currently.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,3 @@ python3 compute_stats.py --in $BASE.s1.json --scale $SCALE
# Copy figures
scp "atlas:/data/lmzheng/StartChat/startchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/*.pdf" .
```

2 changes: 1 addition & 1 deletion startchat/serve/shutdown_serve.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
Usage:
python shutdown_serve.py --down all
options: "all","controller","model_worker","openai_api_server", `all` means to stop all related servers
options: "all","controller","model_worker","openai_api_server", `all` means to stop all related servers
"""

import argparse
Expand Down
4 changes: 3 additions & 1 deletion startchat/train/llama2_flash_attn_monkey_patch.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,9 @@ def replace_llama_attn_with_flash_attn():


def test():
from startchat.train.llama_flash_attn_monkey_patch import forward as startchat_forward
from startchat.train.llama_flash_attn_monkey_patch import (
forward as startchat_forward,
)
from transformers.models.llama.configuration_llama import LlamaConfig

config = LlamaConfig(
Expand Down

0 comments on commit 0b502bf

Please sign in to comment.