-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
286 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
## ServerlessLLM CLI Documentation | ||
|
||
### Overview | ||
`sllm-cli` is a command-line interface (CLI) tool designed for managing and interacting with ServerlessLLM models. This document provides an overview of the available commands and their usage. | ||
|
||
### Getting Started | ||
|
||
Before using the `sllm-cli` commands, you need to start the ServerlessLLM cluster. Follow the guides below to set up your cluster: | ||
|
||
- [Installation Guide](../getting_started/installation.md) | ||
- [Docker Quickstart Guide](../getting_started/docker_quickstart.md) | ||
- [Quickstart Guide](../getting_started/quickstart.md) | ||
|
||
After setting up the ServerlessLLM cluster, you can use the commands listed below to manage and interact with your models. | ||
|
||
### sllm-cli deploy | ||
Deploy a model using a configuration file or model name. | ||
|
||
##### Usage | ||
```bash | ||
sllm-cli deploy [OPTIONS] | ||
``` | ||
|
||
##### Options | ||
- `--model <model_name>` | ||
- Model name to deploy with default configuration. | ||
|
||
- `--config <config_path>` | ||
- Path to the JSON configuration file. | ||
|
||
##### Example | ||
```bash | ||
sllm-cli deploy --model facebook/opt-1.3b | ||
sllm-cli deploy --config /path/to/config.json | ||
``` | ||
|
||
##### Example Configuration File (`config.json`) | ||
```json | ||
{ | ||
"model": "facebook/opt-1.3b", | ||
"backend": "transformers", | ||
"num_gpus": 1, | ||
"auto_scaling_config": { | ||
"metric": "concurrency", | ||
"target": 1, | ||
"min_instances": 0, | ||
"max_instances": 10 | ||
}, | ||
"backend_config": { | ||
"pretrained_model_name_or_path": "facebook/opt-1.3b", | ||
"device_map": "auto", | ||
"torch_dtype": "float16" | ||
} | ||
} | ||
``` | ||
|
||
### sllm-cli delete | ||
Delete deployed models by name. | ||
|
||
##### Usage | ||
```bash | ||
sllm-cli delete [MODELS] | ||
``` | ||
|
||
##### Arguments | ||
- `MODELS` | ||
- Space-separated list of model names to delete. | ||
|
||
##### Example | ||
```bash | ||
sllm-cli delete facebook/opt-1.3b facebook/opt-2.7b meta/llama2 | ||
``` | ||
|
||
### sllm-cli generate | ||
Generate outputs using the deployed model. | ||
|
||
##### Usage | ||
```bash | ||
sllm-cli generate [OPTIONS] <input_path> | ||
``` | ||
|
||
##### Options | ||
- `-t`, `--threads <num_threads>` | ||
- Number of parallel generation processes. Default is 1. | ||
|
||
##### Arguments | ||
- `input_path` | ||
- Path to the JSON input file. | ||
|
||
##### Example | ||
```bash | ||
sllm-cli generate --threads 4 /path/to/request.json | ||
``` | ||
|
||
##### Example Request File (`request.json`) | ||
```json | ||
{ | ||
"model": "facebook/opt-1.3b", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Please introduce yourself." | ||
} | ||
], | ||
"temperature": 0.3, | ||
"max_tokens": 50 | ||
} | ||
``` | ||
|
||
### sllm-cli replay | ||
Replay requests based on workload and dataset. | ||
|
||
##### Usage | ||
```bash | ||
sllm-cli replay [OPTIONS] | ||
``` | ||
|
||
##### Options | ||
- `--workload <workload_path>` | ||
- Path to the JSON workload file. | ||
|
||
- `--dataset <dataset_path>` | ||
- Path to the JSON dataset file. | ||
|
||
- `--output <output_path>` | ||
- Path to the output JSON file for latency results. Default is `latency_results.json`. | ||
|
||
##### Example | ||
```bash | ||
sllm-cli replay --workload /path/to/workload.json --dataset /path/to/dataset.json --output /path/to/output.json | ||
``` | ||
|
||
#### sllm-cli update | ||
Update a deployed model using a configuration file or model name. | ||
|
||
##### Usage | ||
```bash | ||
sllm-cli update [OPTIONS] | ||
``` | ||
|
||
##### Options | ||
- `--model <model_name>` | ||
- Model name to update with default configuration. | ||
|
||
- `--config <config_path>` | ||
- Path to the JSON configuration file. | ||
|
||
##### Example | ||
```bash | ||
sllm-cli update --model facebook/opt-1.3b | ||
sllm-cli update --config /path/to/config.json | ||
``` | ||
|
||
### Example Workflow | ||
|
||
1. **Deploy a Model** | ||
```bash | ||
sllm-cli deploy --model facebook/opt-1.3b | ||
``` | ||
|
||
2. **Generate Output** | ||
```bash | ||
echo '{ | ||
"model": "facebook/opt-1.3b", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Please introduce yourself." | ||
} | ||
], | ||
"temperature": 0.7, | ||
"max_tokens": 50 | ||
}' > input.json | ||
sllm-cli generate input.json | ||
``` | ||
|
||
3. **Delete a Model** | ||
```bash | ||
sllm-cli delete facebook/opt-1.3b | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,7 @@ | ||
--- | ||
sidebar_position: 0 | ||
--- | ||
|
||
# Installations | ||
|
||
## Requirements | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
--- | ||
sidebar_position: 0 | ||
--- | ||
|
||
# Quickstart Guide | ||
|
||
ServerlessLLM Store (`sllm-store`) is a Python library that supports fast model checkpoint loading from multi-tier storage (i.e., DRAM, SSD, HDD) into GPUs. | ||
|
||
ServerlessLLM Store provides a model manager and two key functions: | ||
- `save_model`: Convert a HuggingFace model into a loading-optimized format and save it to a local path. | ||
- `load_model`: Load a model into given GPUs. | ||
|
||
|
||
## Requirements | ||
- OS: Ubuntu 20.04 | ||
- Python: 3.10 | ||
- GPU: compute capability 7.0 or higher | ||
|
||
## Installations | ||
|
||
### Create a virtual environment | ||
```bash | ||
conda create -n sllm-store python=3.10 -y | ||
conda activate sllm-store | ||
``` | ||
|
||
### Install with pip | ||
```bash | ||
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ serverless_llm_store==0.0.1.dev3 | ||
``` | ||
|
||
## Usage Examples | ||
:::tip | ||
We highly recommend using a fast storage device (e.g., NVMe SSD) to store the model files for the best experience. | ||
For example, create a directory `models` on the NVMe SSD and link it to the local path. | ||
```bash | ||
mkdir -p /mnt/nvme/models # Replace '/mnt/nvme' with your NVMe SSD path. | ||
ln -s /mnt/nvme/models ./models | ||
``` | ||
::: | ||
|
||
1. Convert a model to ServerlessLLM format and save it to a local path: | ||
```python | ||
from serverless_llm_store import save_model | ||
|
||
# Load a model from HuggingFace model hub. | ||
import torch | ||
from transformers import AutoModelForCausalLM | ||
model = AutoModelForCausalLM.from_pretrained('facebook/opt-1.3b', torch_dtype=torch.float16) | ||
|
||
# Replace './models' with your local path. | ||
save_model(model, './models/facebook/opt-1.3b') | ||
``` | ||
|
||
2. Launch the checkpoint store server in a separate process: | ||
```bash | ||
# 'mem_pool_size' is the maximum size of the memory pool in GB. It should be larger than the model size. | ||
sllm-store-server --storage_path $PWD/models --mem_pool_size 32 | ||
``` | ||
|
||
<!-- Running the server using a container: | ||
```bash | ||
docker build -t checkpoint_store_server -f Dockerfile . | ||
# Make sure the models have been downloaded using examples/save_model.py script | ||
docker run -it --rm -v $PWD/models:/app/models checkpoint_store_server | ||
``` --> | ||
|
||
3. Load model in your project and make inference: | ||
```python | ||
import time | ||
import torch | ||
from serverless_llm_store import load_model | ||
|
||
# warm up the GPU | ||
for _ in range(torch.cuda.device_count()): | ||
torch.randn(1).cuda() | ||
|
||
start = time.time() | ||
model = load_model("facebook/opt-1.3b", device_map="auto", torch_dtype=torch.float16, storage_path="./models/", fully_parallel=True) | ||
# Please note the loading time depends on the model size and the hardware bandwidth. | ||
print(f"Model loading time: {time.time() - start:.2f}s") | ||
|
||
from transformers import AutoTokenizer | ||
|
||
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b') | ||
inputs = tokenizer('Hello, my dog is cute', return_tensors='pt').to("cuda") | ||
outputs = model.generate(**inputs) | ||
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | ||
``` | ||
|
||
4. Clean up by "Ctrl+C" the server process. |