Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama stack distributions / templates / docker refactor #266

Merged
merged 32 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
293d8f2
docker compose ollama
yanxi0830 Oct 18, 2024
542ffbe
comment
yanxi0830 Oct 18, 2024
dcac9e4
update compose file
yanxi0830 Oct 18, 2024
a3f748a
readme for distributions
yanxi0830 Oct 18, 2024
fd90d2a
readme
yanxi0830 Oct 18, 2024
b4aca0a
move distribution folders
yanxi0830 Oct 19, 2024
cbb423a
move distribution/templates to distributions/
yanxi0830 Oct 19, 2024
c830235
rename
yanxi0830 Oct 19, 2024
955743b
kill distribution/templates
yanxi0830 Oct 19, 2024
100b5fe
readme
yanxi0830 Oct 19, 2024
f58441c
readme
yanxi0830 Oct 19, 2024
302fa5c
build/developer cookbook/new api provider
yanxi0830 Oct 21, 2024
d4caab3
developer cookbook
yanxi0830 Oct 21, 2024
5ea36b0
readme
yanxi0830 Oct 21, 2024
29c8edb
readme
yanxi0830 Oct 21, 2024
2f5c410
[bugfix] fix case for agent when memory bank registered without speci…
yanxi0830 Oct 18, 2024
a90ab58
Add an option to not use elastic agents for meta-reference inference …
ashwinb Oct 18, 2024
6f4537b
Allow overridding checkpoint_dir via config
ashwinb Oct 18, 2024
92aca57
Small rename
ashwinb Oct 18, 2024
5863f65
Make all methods `async def` again; add completion() for meta-referen…
ashwinb Oct 19, 2024
89759a0
Improve an important error message
Oct 20, 2024
391dedd
update ollama for llama-guard3
Oct 20, 2024
74e6356
Add vLLM inference provider for OpenAI compatible vLLM server (#178)
terrytangyuan Oct 21, 2024
af52c22
Create .readthedocs.yaml
raghotham Oct 21, 2024
8ef3d3d
Update event_logger.py (#275)
nehal-a2z Oct 21, 2024
ca2e7f5
vllm
yanxi0830 Oct 21, 2024
3ca822f
build templates
yanxi0830 Oct 21, 2024
202667f
delete templates
yanxi0830 Oct 21, 2024
acfcbca
tmp add back build to avoid merge conflicts
yanxi0830 Oct 21, 2024
88187bc
vllm
yanxi0830 Oct 21, 2024
8a50426
vllm
yanxi0830 Oct 21, 2024
8593c94
Merge branch 'main' into ollama_docker
yanxi0830 Oct 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
# You can also specify other tool versions:
# nodejs: "19"
# rust: "1.64"
# golang: "1.19"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
# python:
# install:
# - requirements: docs/requirements.txt
11 changes: 11 additions & 0 deletions distributions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Llama Stack Distribution

A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.


## Quick Start Llama Stack Distributions Guide
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
| Meta Reference | llamastack/distribution-meta-reference-gpu | [Guide](./meta-reference-gpu/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| Ollama | llamastack/distribution-ollama | [Guide](./ollama/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| TGI | llamastack/distribution-tgi | [Guide](./tgi/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-bedrock-conda-example
name: bedrock
distribution_spec:
description: Use Amazon Bedrock APIs.
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-databricks
name: databricks
distribution_spec:
description: Use Databricks for running LLM inference
providers:
Expand All @@ -7,4 +7,4 @@ distribution_spec:
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
image_type: conda
image_type: conda
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-fireworks
name: fireworks
distribution_spec:
description: Use Fireworks.ai for running LLM inference
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-hf-endpoint
name: hf-endpoint
distribution_spec:
description: "Like local, but use Hugging Face Inference Endpoints for running LLM inference.\nSee https://hf.co/docs/api-endpoints."
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-hf-serverless
name: hf-serverless
distribution_spec:
description: "Like local, but use Hugging Face Inference API (serverless) for running LLM inference.\nSee https://hf.co/docs/api-inference."
providers:
Expand Down
33 changes: 33 additions & 0 deletions distributions/meta-reference-gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Meta Reference Distribution

The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.


| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |


### Start the Distribution (Single Node GPU)

> [!NOTE]
> This assumes you have access to GPU to start a TGI server with access to your GPU.

> [!NOTE]
> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
```
export LLAMA_CHECKPOINT_DIR=~/.llama
```

> [!NOTE]
> `~/.llama` should be the path containing downloaded weights of Llama models.


To download and start running a pre-built docker container, you may use the following commands:

```
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
```

### Alternative (Build and start distribution locally via conda)
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on starting up a meta-reference distribution.
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
name: local-gpu
name: distribution-meta-reference-gpu
distribution_spec:
description: Use code from `llama_stack` itself to serve all llama stack APIs
providers:
inference: meta-reference
memory: meta-reference
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ apis:
- safety
providers:
inference:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
model: Llama3.1-8B-Instruct
Expand All @@ -22,7 +22,7 @@ providers:
max_seq_len: 4096
max_batch_size: 1
safety:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
Expand All @@ -33,18 +33,18 @@ providers:
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config: {}
91 changes: 91 additions & 0 deletions distributions/ollama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Ollama Distribution

The `llamastack/distribution-ollama` distribution consists of the following provider configurations.

| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference |


### Start a Distribution (Single Node GPU)

> [!NOTE]
> This assumes you have access to GPU to start a Ollama server with access to your GPU.

```
$ cd llama-stack/distribution/ollama/gpu
$ ls
compose.yaml run.yaml
$ docker compose up
```

You will see outputs similar to following ---
```
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
[llamastack] | Resolved 12 providers
[llamastack] | inner-inference => ollama0
[llamastack] | models => __routing_table__
[llamastack] | inference => __autorouted__
```

To kill the server
```
docker compose down
```

### Start the Distribution (Single Node CPU)

> [!NOTE]
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.

```
$ cd llama-stack/distribution/ollama/cpu
$ ls
compose.yaml run.yaml
$ docker compose up
```

### (Alternative) ollama run + llama stack Run

If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.

#### Start Ollama server.
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.

**Via Docker**
```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

**Via CLI**
```
ollama run <model_id>
```

#### Start Llama Stack server pointing to Ollama server

**Via Docker**
```
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./ollama-run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-ollama.yaml
```

Make sure in you `ollama-run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
```
inference:
- provider_id: ollama0
provider_type: remote::ollama
config:
url: http://127.0.0.1:14343
```

**Via Conda**

```
llama stack build --config ./build.yaml
llama stack run ./gpu/run.yaml
```
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
name: local-ollama
name: distribution-ollama
distribution_spec:
description: Like local, but use ollama for running LLM inference
description: Use ollama for running LLM inference
providers:
inference: remote::ollama
memory: meta-reference
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
Expand Down
30 changes: 30 additions & 0 deletions distributions/ollama/cpu/compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
services:
ollama:
image: ollama/ollama:latest
network_mode: "host"
volumes:
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
ports:
- "11434:11434"
command: []
llamastack:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to ollama run.yaml file
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
# Hack: wait for ollama server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
deploy:
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s
volumes:
ollama:
46 changes: 46 additions & 0 deletions distributions/ollama/cpu/run.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
inference:
- provider_id: ollama0
provider_type: remote::ollama
config:
url: http://127.0.0.1:14343
safety:
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
disable_input_check: false
disable_output_check: false
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta0
provider_type: meta-reference
config: {}
Loading