Run llama.cpp on RunPod

Description

RunPod provides a cheap serverless GPU service that allows to simply serve AI models. They handle queuing and auto-scaling.

You just have to provide a Docker image. This repository contains instructions to build your own image for any model.

Steps

Clone this repository
Choose a model and download it to the workspace directory. Here we use this model with 13B parameters.

wget -P workspace https://huggingface.co/TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGML/resolve/main/wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q4_K_M.bin

Build the Docker image. Create a llama-runpod repository on Docker Hub and replace your-docker-hub-login with your login.

docker build -t llama-runpod .
docker tag llama-runpod your-docker-hub-login/llama-runpod:latest
docker push your-docker-hub-login/llama-runpod:latest

Go to RunPod's serverless console and create a template:

You can pass the arguments to llama_cpp in the LLAMA_ARGS environment variable. Here are mine:

{"model_path": "wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q4_K_M.bin", "n_gpu_layers": -1}

n_gpu_layers is set to -1 to offload all layers to the GPU.

Create the endpoint:

Profit!

Replace ENDPOINT_ID and API_KEY with your own values. You can get API_KEY on that page.

import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID"
headers = {"Authorization": "API_KEY"}


payload = {"input": {"prompt": "Me: Hello, what is your purpose?\nAI:"}}

# sync (blocking)
r = requests.post(url + "/runsync", json=payload, headers=headers)
r.json()

# async (non-blocking)
r = requests.post(url + "/run", json=payload, headers=headers)
id_ = r.json()["id"]

# get async result
r = requests.get(url + f"/status/{id_}", headers=headers)
r.json()

You can pass the keyword arguments to LLaMa in the payload. See the llama_cpp docs for other arguments.

Additional details and tips

clean Docker after a build or if you get into trouble: docker system prune -a
debug your Docker image with docker run -it llama-runpod
we froze llama-cpp-python==0.1.78 in Dockerfile because the model format changed from ggmlv3 to gguf in version 0.1.79 but the conversion script in llama.cpp is not fully working
you can test handle.py locally with python handle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Run llama.cpp on RunPod

Description

Steps

Additional details and tips

Files

README.md

Latest commit

History

README.md

File metadata and controls

Run llama.cpp on RunPod

Description

Steps

Additional details and tips