⚠️ Important Warning⚠️ - 📖 Introduction
- 🚀 One-Click Launch in Colab
- 📊 Example Usage
- 🛠️ Prerequisites
- 📋 Step-by-Step Explanation
- 🔧 Testing the API
- ❓ FAQ
- Q1: What is the purpose of mounting Google Drive?
- Q2: Do I need a powerful GPU?
- Q3: I encountered the error "Failed to infer device type." What does it mean?
- Q4: How do I troubleshoot API connection issues?
- Q5: Can I modify the model?
- Q6: What is the API Key and model name?
- Q7: Why doesn’t the public API URL work with the Sakura workspace?
- Q8: Is it safe to use Colab for this project?
- Q9: Is there a one-click setup?
- 🙌 Acknowledgments
🚨 Kaggle Ban Notice: Kaggle has officially banned all SakuraLLM models. Using them on Kaggle will result in a permanent account ban.
👉 Alternative Options: Use GPU rental services or community compute-sharing platforms.
🔗 For more details, see the Issue Report.
This repository offers an easy-to-use Google Colab notebook to deploy the Sakura-14B-Qwen2.5-v1.0-GGUF model:
- Backend: Uses vLLM for OpenAI-style API compatibility.
- Applications: Translation tools, GPT-based custom bots, and other AI utilities.
- Visualization: See example output below!
✔️ Easy one-click setup in Colab.
✔️ Supports OpenAI API for effortless integration.
✔️ Multiple API forwarding options (ngrok or localtunnel).
✔️ Beginner-friendly guidance, with rich examples and troubleshooting steps.
Below is a visualization of how the API to be used. Input any text in the prompt for dynamic results!
The vLLM backend supports adding multiple translators at the same time (request concurrency)
To get started, you’ll need:
- Google Account: Access Colab.
- Python Libraries (installed automatically in the notebook).
transformers
,tokenizers
,vLLM
,huggingface-hub
,flask
,pyngrok
,triton
,torch
.
- ngrok Token (Highly Suggest): Sign up at ngrok, log in, and copy your token for API forwarding.
Mounting Google Drive ensures you can save models persistently across sessions.
Paste this code into a Colab cell:
from google.colab import drive
drive.mount('/content/gdrive')
ROOT_PATH = "/content/gdrive/MyDrive"
/content
if you choose not to mount:
ROOT_PATH = "/content"
To install all necessary libraries, run the following command in your Colab notebook:
!pip install --upgrade pip
!pip install transformers tokenizers vllm huggingface-hub flask pyngrok triton torch
Sometimes, during the installation process, you may encounter errors or warnings similar to the one shown below:
What to do:
This is a normal occurrence, especially in Colab environments. Simply follow the instructions provided by the error message. In most cases, you can resolve the issue by clicking:
- Runtime (in the top Colab menu).
- Select "Run all" to restart the setup process.
If it prompts you to re-enable certain permissions or runtime settings, accept those changes.
For further reassurance, here's another example of such an error you might see:
By following these simple steps, you'll ensure that all dependencies are properly installed and your environment is correctly configured. If you continue to face issues, consider resetting the Colab runtime and re-running the commands.
This step helps you configure API forwarding so that your application can be accessed over the internet, even if you don't have a static domain. We'll guide you through two options:
- Using Ngrok - For static domain forwarding with a custom domain.
- Using Cloudflare Tunnel - For temporary API access.
Ngrok simplifies API forwarding by providing a secure and reliable tunnel. If you need a static domain for consistent access, follow these steps:
-
Get Your Ngrok Authentication Token
- Visit Ngrok's Dashboard.
- Sign up or log in to your account.
- Copy your authentication token from the page.
-
Obtain a Free Static Domain
- Navigate to the Domains section on the Ngrok Dashboard.
- Click "New Domain" to create a free static domain for your API.
Add the following Python code to your project to set up Ngrok with your static domain:
# Set the ngrok authentication token (get it from https://dashboard.ngrok.com/get-started/your-authtoken)
ngrokToken = "" # Add your token here
if ngrokToken:
from pyngrok import conf, ngrok
# Configure the ngrok authentication token
conf.get_default().auth_token = ngrokToken
conf.get_default().monitor_thread = False
# Start ngrok tunnel with the custom domain
try:
# Set the ngrok free static domain (from https://dashboard.ngrok.com/domains)
ssh_tunnel = ngrok.connect(8001, bind_tls=True, hostname="")
public_url = ssh_tunnel.public_url
print('Custom Domain Address: ' + public_url)
except Exception as e:
print(f"Error starting ngrok tunnel: {e}")
- The script will display your public API URL on the first line of the output.
- Save this URL; you’ll use it for making API requests.
If you don’t have a static domain or an Ngrok authentication token, Cloudflare’s cloudflared
provides a quick, temporary URL for API access.
-
Download the
cloudflared
Binary
Run the following commands to download and preparecloudflared
for use:# Navigate to the root path of your project %cd $ROOT_PATH # Download the latest Cloudflare binary !wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared !chmod a+x cloudflared
-
Start the Tunnel
Execute this command to open a Cloudflare tunnel:!./cloudflared tunnel --url localhost:8001
- A temporary Cloudflare URL will appear in the output.
- Use this URL for testing API requests during your development session.
-
Ngrok vs. Cloudflare:
- Use Ngrok if you need a persistent, custom domain for reliable API access.
- Use Cloudflare for a quick, temporary solution without account setup.
-
Best Practices:
- Save the displayed URLs immediately after running the commands.
- Test the connection by accessing the URL in your browser or API client.
By following these steps, you’ll successfully set up API forwarding for your project using either Ngrok or Cloudflare.
Once the tunnel is set up, proceed to download the model and run the API server:
# Navigate to the root path
%cd $ROOT_PATH
# Download the model from Hugging Face
!HF_ENDPOINT=https://huggingface.co huggingface-cli download SakuraLLM/Sakura-14B-Qwen2.5-v1.0-GGUF --local-dir models --include sakura-14b-qwen2.5-v1.0-q6k.gguf
# Start the API server with vLLM
!RAY_memory_monitor_refresh_ms="0" HF_ENDPOINT=https://huggingface.co OMP_NUM_THREADS=36 \
VLLM_ATTENTION_BACKEND=XFORMERS vllm serve ./models/sakura-14b-qwen2.5-v1.0-q6k.gguf \
--tokenizer Qwen/Qwen2.5-14B-Instruct --dtype float16 --api-key token-abc123 \
--kv-cache-dtype auto --max-model-len 4096 --tensor-parallel-size 1 \
--gpu-memory-utilization 0.99 --disable-custom-all-reduce --enforce-eager \
--use-v2-block-manager --disable-log-requests --host 0.0.0.0 --port 8001 \
--served-model-name "Qwen2.5-14B-Instruct" &
⏳ Output: Your API backend will start, and you can use the displayed public URL from the previous step for making requests.
Use the following script to send requests to the API. Replace <YOUR_API_URL>
with the public API endpoint.
import requests
API_ENDPOINT = "<YOUR_API_URL>" # Replace with your API URL
API_KEY = "token-abc123" # Replace with your API key
headers = {"Authorization": f"Bearer {API_KEY}"}
data = {
"prompt": "Translate this sentence to Japanese: 'Good morning!'",
"max_tokens": 50
}
response = requests.post(f"{API_ENDPOINT}/v1/completions", json=data, headers=headers)
print(response.json())
Mounting Drive ensures that all downloaded models and intermediate data are preserved between sessions. Without this, data is erased when Colab disconnects.
No, the notebook is designed for use with free-tier Colab GPUs. However, performance and speed may improve if you use Colab Pro, Colab Pro+, or rented GPUs with higher memory capacity.
This typically happens when:
- Insufficient GPU memory is available for the model.
- The notebook is configured to use GPU, but Colab assigns a CPU-only runtime.
- A dependency conflict in the environment.
- Check if the assigned runtime is GPU-enabled (
Runtime > Change runtime type > Hardware accelerator > GPU
). - Verify that the model's size fits within your GPU’s memory capacity (15GB is usually sufficient for most tasks).
- Restart the runtime, clear storage if needed, and re-run the notebook.
- Ensure your
ngrokToken
is correctly set and matches your ngrok account. - Confirm that the generated public URL appears in the output after Step 3.
- Verify the connection by opening the public URL in a browser to check the API's availability.
- Ensure the API Key specified during setup (e.g.,
--api-key token-abc123
) matches the key used in Sakura. - Double-check the model name (e.g.,
--served-model-name "Qwen2.5-14B-Instruct"
) for consistency.
For Cloudflare users, remember that the URL is temporary and may require reconfiguration between sessions.
Yes! The notebook supports fine-tuning or modifications to the model using tools like Hugging Face's transformers library. For large models, ensure you have sufficient GPU resources. Cloud environments may not always support extensive fine-tuning, but you can use your local machine or rented servers for advanced customization.
-
The API Key is manually set during the vLLM backend setup. For example, in the code:
--api-key token-abc123
Use the same API Key (
token-abc123
) in the Sakura workspace or other applications. -
The model name is defined in the setup as:
--served-model-name "Qwen2.5-14B-Instruct"
Ensure you input this exact model name in the application using the API.
This is commonly due to:
- Incorrect API Key: Ensure the key matches what was set during the vLLM server startup.
- Model name mismatch: The model name used in the backend must match the one configured in the Sakura workspace.
- Dynamic URLs: If you use a non-static URL (e.g., without a custom ngrok domain), the URL changes with each session.
- Double-check the API Key and model name used in the workspace.
- For persistent URLs, use a static ngrok domain or migrate to a local hosting setup for stability.
Using Colab for personal or research projects is generally acceptable. However, avoid engaging in activities that violate Google's usage policies, such as excessive resource usage or sharing Colab-generated APIs publicly without restrictions. Always respect the terms of service to prevent potential limitations or bans.
Yes! The Colab notebook includes a one-click launcher link for streamlined operation. Simply open the link and run all cells without modifying the code. This setup is tested to work seamlessly as long as runtime conditions are met.
- SakuraLLM: For providing the model.
- SakuraLLM-Notebooks: For inspiring this repository.