Run llama.cpp on RunPod
RunPod provides a cheap serverless GPU service that allows to simply serve AI models. They handle queuing and auto-scaling.
You just have to provide a Docker image. This repository contains instructions to build your own image for any model.
- Clone this repository
- Choose a model and download it to the
workspace
directory. Here we use this model with 13B parameters.
wget -P workspace https://huggingface.co/TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GGML/resolve/main/wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q4_K_M.bin
- Build the Docker image. Create a
llama-runpod
repository on Docker Hub and replaceyour-docker-hub-login
with your login.
docker build -t llama-runpod .
docker tag llama-runpod your-docker-hub-login/llama-runpod:latest
docker push your-docker-hub-login/llama-runpod:latest
- Go to RunPod's serverless console and create a template:
You can pass the arguments to llama_cpp
in the LLAMA_ARGS
environment variable. Here are mine:
{"model_path": "wizardlm-1.0-uncensored-llama2-13b.ggmlv3.q4_K_M.bin", "n_gpu_layers": -1}
n_gpu_layers
is set to -1 to offload all layers to the GPU.
- Create the endpoint:
- Profit!
Replace ENDPOINT_ID
and API_KEY
with your own values. You can get API_KEY
on that page.
import requests
url = "https://api.runpod.ai/v2/ENDPOINT_ID"
headers = {"Authorization": "API_KEY"}
payload = {"input": {"prompt": "Me: Hello, what is your purpose?\nAI:"}}
# sync (blocking)
r = requests.post(url + "/runsync", json=payload, headers=headers)
r.json()
# async (non-blocking)
r = requests.post(url + "/run", json=payload, headers=headers)
id_ = r.json()["id"]
# get async result
r = requests.get(url + f"/status/{id_}", headers=headers)
r.json()
You can pass the keyword arguments to LLaMa in the payload. See the llama_cpp docs for other arguments.
- clean Docker after a build or if you get into trouble:
docker system prune -a
- debug your Docker image with
docker run -it llama-runpod
- we froze
llama-cpp-python==0.1.78
inDockerfile
because the model format changed fromggmlv3
togguf
in version0.1.79
but the conversion script in llama.cpp is not fully working - you can test
handle.py
locally withpython handle.py