Skip to content

Commit

Permalink
update: manual sync
Browse files Browse the repository at this point in the history
  • Loading branch information
Chivier committed Jul 10, 2024
1 parent 713eed2 commit 95c3fc8
Show file tree
Hide file tree
Showing 7 changed files with 286 additions and 1 deletion.
180 changes: 180 additions & 0 deletions docs/stable/cli/SLLM-CLI Documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
## ServerlessLLM CLI Documentation

### Overview
`sllm-cli` is a command-line interface (CLI) tool designed for managing and interacting with ServerlessLLM models. This document provides an overview of the available commands and their usage.

### Getting Started

Before using the `sllm-cli` commands, you need to start the ServerlessLLM cluster. Follow the guides below to set up your cluster:

- [Installation Guide](../getting_started/installation.md)
- [Docker Quickstart Guide](../getting_started/docker_quickstart.md)
- [Quickstart Guide](../getting_started/quickstart.md)

After setting up the ServerlessLLM cluster, you can use the commands listed below to manage and interact with your models.

### sllm-cli deploy
Deploy a model using a configuration file or model name.

##### Usage
```bash
sllm-cli deploy [OPTIONS]
```

##### Options
- `--model <model_name>`
- Model name to deploy with default configuration.

- `--config <config_path>`
- Path to the JSON configuration file.

##### Example
```bash
sllm-cli deploy --model facebook/opt-1.3b
sllm-cli deploy --config /path/to/config.json
```

##### Example Configuration File (`config.json`)
```json
{
"model": "facebook/opt-1.3b",
"backend": "transformers",
"num_gpus": 1,
"auto_scaling_config": {
"metric": "concurrency",
"target": 1,
"min_instances": 0,
"max_instances": 10
},
"backend_config": {
"pretrained_model_name_or_path": "facebook/opt-1.3b",
"device_map": "auto",
"torch_dtype": "float16"
}
}
```

### sllm-cli delete
Delete deployed models by name.

##### Usage
```bash
sllm-cli delete [MODELS]
```

##### Arguments
- `MODELS`
- Space-separated list of model names to delete.

##### Example
```bash
sllm-cli delete facebook/opt-1.3b facebook/opt-2.7b meta/llama2
```

### sllm-cli generate
Generate outputs using the deployed model.

##### Usage
```bash
sllm-cli generate [OPTIONS] <input_path>
```

##### Options
- `-t`, `--threads <num_threads>`
- Number of parallel generation processes. Default is 1.

##### Arguments
- `input_path`
- Path to the JSON input file.

##### Example
```bash
sllm-cli generate --threads 4 /path/to/request.json
```

##### Example Request File (`request.json`)
```json
{
"model": "facebook/opt-1.3b",
"messages": [
{
"role": "user",
"content": "Please introduce yourself."
}
],
"temperature": 0.3,
"max_tokens": 50
}
```

### sllm-cli replay
Replay requests based on workload and dataset.

##### Usage
```bash
sllm-cli replay [OPTIONS]
```

##### Options
- `--workload <workload_path>`
- Path to the JSON workload file.

- `--dataset <dataset_path>`
- Path to the JSON dataset file.

- `--output <output_path>`
- Path to the output JSON file for latency results. Default is `latency_results.json`.

##### Example
```bash
sllm-cli replay --workload /path/to/workload.json --dataset /path/to/dataset.json --output /path/to/output.json
```

#### sllm-cli update
Update a deployed model using a configuration file or model name.

##### Usage
```bash
sllm-cli update [OPTIONS]
```

##### Options
- `--model <model_name>`
- Model name to update with default configuration.

- `--config <config_path>`
- Path to the JSON configuration file.

##### Example
```bash
sllm-cli update --model facebook/opt-1.3b
sllm-cli update --config /path/to/config.json
```

### Example Workflow

1. **Deploy a Model**
```bash
sllm-cli deploy --model facebook/opt-1.3b
```

2. **Generate Output**
```bash
echo '{
"model": "facebook/opt-1.3b",
"messages": [
{
"role": "user",
"content": "Please introduce yourself."
}
],
"temperature": 0.7,
"max_tokens": 50
}' > input.json
sllm-cli generate input.json
```

3. **Delete a Model**
```bash
sllm-cli delete facebook/opt-1.3b
```
4 changes: 4 additions & 0 deletions docs/stable/getting_started/docker_quickstart.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
sidebar_position: 2
---

# Docker Quickstart Guide

This guide will help you get started with the basics of using ServerlessLLM with Docker. Please make sure you have Docker installed on your system and have installed ServerlessLLM CLI following the [installation guide](./installation.md).
Expand Down
4 changes: 4 additions & 0 deletions docs/stable/getting_started/installation.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
sidebar_position: 0
---

# Installations

## Requirements
Expand Down
4 changes: 4 additions & 0 deletions docs/stable/getting_started/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
sidebar_position: 1
---

# Quickstart Guide

This guide will help you get started with the basics of using ServerlessLLM. Please make sure you have installed the ServerlessLLM following the [installation guide](./installation.md).
Expand Down
3 changes: 2 additions & 1 deletion docs/stable/store/_category_.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"label": "ServerlessLLM Store",
"position": 5,
"link": {
"type": "generated-index"
"type": "generated-index",
"description": "`sllm-store` is an internal library of ServerlessLLM that provides high-performance model loading from local storage into GPU memory. You can also install and use this library in your own projects."
}
}
Empty file.
92 changes: 92 additions & 0 deletions docs/stable/store/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
sidebar_position: 0
---

# Quickstart Guide

ServerlessLLM Store (`sllm-store`) is a Python library that supports fast model checkpoint loading from multi-tier storage (i.e., DRAM, SSD, HDD) into GPUs.

ServerlessLLM Store provides a model manager and two key functions:
- `save_model`: Convert a HuggingFace model into a loading-optimized format and save it to a local path.
- `load_model`: Load a model into given GPUs.


## Requirements
- OS: Ubuntu 20.04
- Python: 3.10
- GPU: compute capability 7.0 or higher

## Installations

### Create a virtual environment
```bash
conda create -n sllm-store python=3.10 -y
conda activate sllm-store
```

### Install with pip
```bash
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ serverless_llm_store==0.0.1.dev3
```

## Usage Examples
:::tip
We highly recommend using a fast storage device (e.g., NVMe SSD) to store the model files for the best experience.
For example, create a directory `models` on the NVMe SSD and link it to the local path.
```bash
mkdir -p /mnt/nvme/models # Replace '/mnt/nvme' with your NVMe SSD path.
ln -s /mnt/nvme/models ./models
```
:::

1. Convert a model to ServerlessLLM format and save it to a local path:
```python
from serverless_llm_store import save_model

# Load a model from HuggingFace model hub.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('facebook/opt-1.3b', torch_dtype=torch.float16)

# Replace './models' with your local path.
save_model(model, './models/facebook/opt-1.3b')
```

2. Launch the checkpoint store server in a separate process:
```bash
# 'mem_pool_size' is the maximum size of the memory pool in GB. It should be larger than the model size.
sllm-store-server --storage_path $PWD/models --mem_pool_size 32
```

<!-- Running the server using a container:
```bash
docker build -t checkpoint_store_server -f Dockerfile .
# Make sure the models have been downloaded using examples/save_model.py script
docker run -it --rm -v $PWD/models:/app/models checkpoint_store_server
``` -->

3. Load model in your project and make inference:
```python
import time
import torch
from serverless_llm_store import load_model

# warm up the GPU
for _ in range(torch.cuda.device_count()):
torch.randn(1).cuda()

start = time.time()
model = load_model("facebook/opt-1.3b", device_map="auto", torch_dtype=torch.float16, storage_path="./models/", fully_parallel=True)
# Please note the loading time depends on the model size and the hardware bandwidth.
print(f"Model loading time: {time.time() - start:.2f}s")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')
inputs = tokenizer('Hello, my dog is cute', return_tensors='pt').to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

4. Clean up by "Ctrl+C" the server process.

0 comments on commit 95c3fc8

Please sign in to comment.