Skip to content

Latest commit

 

History

History
147 lines (99 loc) · 3.75 KB

README.md

File metadata and controls

147 lines (99 loc) · 3.75 KB

mm_server Model Service Installation

English | 简体中文

Environment Installation

You can useminiconda for a lightweight installation of the python conda environment.

# Install the virtual environment
conda create -n qllm python==3.11
source activate qllm

cd mm_server

# Install python dependencies
pip install -r requirements.txt

Modify Configuration File

Edit the model configuration items:

cp config.py.local config.py

Modify the config.py configuration file according to your usage:

clip_model_name: Visual embedding model, text embedding model ocr_model_path: OCR text recognition model hf_embedding_model_name: Text embedding model audio_model_name: Video to text model

@dataclass
class Configer:
    # mm server host
    server_host = "localhost"
    server_port = 50110

    # image encoder model name
    clip_model_name = "ViT-B/32"
    # model keep alive time (s)
    clip_keep_alive = 60

    # ocr model path
    ocr_model_path = str(TEMPLATE_PATH / "services/ocr_server/ocr_models")
    # model keep alive time (s)
    ocr_keep_alive = 60

    # text embedding model name
    hf_embedding_model_name = "BAAI/bge-small-en-v1.5"
    # model keep alive time (s)
    hf_keep_alive = 60

    # video transcription  model name
    audio_model_name = "small"
    # model keep alive time (s)
    video_keep_alive = 1

Start the Service

source activate qllm
python main.py

# reload mode, hot update after modifying the code
# uvicorn main:app --reload --host localhost --port 50110

Local Ollama Model Deployment

Refer to the instructions for installing Ollama:

Starting the Local Model Choose the 8b local model, though there will be a loss in performance.

# Local version:
ollama run llama3:8b-instruct-q4_0

(Optional) Local 70B Large Model For servers, you can choose the 70b large model version. You also need to modify model_name in mmrag_server/config.py.

# Remote server version:
ollama run llama3:70b-instruct

(Optional) Model Lifecycle Management Ollama automatically releases the model if it's not used for 5 minutes, and restarts it upon the next access. If you need to keep it loaded for a longer period, you can configure it as follows:

  • Duration string (e.g., "10m" or "24h")
  • Number in seconds (e.g., 3600)
  • Permanently keep the model loaded: any negative number (e.g., -1 or "-1m")
  • Unload the model: "0" will unload the model immediately after generating a response
# Permanently keep the model loaded
curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "keep_alive": -1}'

# Unload the model
curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "keep_alive": 0}'

# Set auto-release time
curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "keep_alive": 3600}'

Model selection


API Documentation:

http://localhost:50110/docs

alt text