Local Contextual RAG API Server

A lightweight & fully customizable API server for Contextual Retrieval-Augmented Generation (RAG) operations, supporting document chunking with context generation, multi-embedding semantic search, and reranking.

Overview

This service provides endpoints for implementing contextual RAG workflows:

Stateless RAG: Stateless operations where you provide both documents and chunks in the request
- Process documents into chunks and generate embeddings
- Query against provided chunks with reranking
Database RAG: Complete contextual RAG pipeline using PostgreSQL (PgVector)
- Document chunking with context-awareness
- Hybrid semantic search with context embeddings
- Flexible context generation using OpenAI or local models

Features

🔍 Text chunking with configurable size and overlap
🧠 Optional context generation using OpenAI or local models
📈 Flexible embedding model selection:
- Choose models per request in stateless operations
- Configure default model for database operations
🎯 Hybrid semantic search with configurable weights (60/40 content/context)
🔄 Cross-encoder reranking for better relevance
📊 Highly configurable parameters for all operations
🚀 Efficient model management with auto-unloading
💾 Choose between stateless or database-backed operation

Search Pipeline

The search pipeline consists of two stages:

Initial Retrieval
- Generates embedding for the query
- Calculates cosine similarity for both content and context embeddings
- Combines similarities with weighted average (60% content, 40% context)
- Applies similarity threshold (if specified)
- Selects top_k most similar chunks
Reranking
- Uses cross-encoder model for more accurate relevance scoring
- Reranks the initial candidates
- Returns final ordered results

The two-stage approach combines the efficiency of embedding-based retrieval with the accuracy of cross-encoder reranking.

Setup

Clone and set up:

git clone https://github.com/jiaweing/localRAG-api.git
cd localRAG-api
pnpm install

Set up PostgreSQL database:
- Install PostgreSQL if not already installed
- Create a new database for the application
- Run migrations with drizzle-kit (coming soon)
- Configure database connection in .env file
Configure environment variables:

cp .env.example .env

Required environment variables:

# Server Configuration
PORT=57352

# OpenAI Configuration (optional)
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL_NAME=gpt-4o-mini # or any other OpenAI model

# Default Models Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2.Q4_K_M # Model used for database RAG operations

# Database Configuration
DATABASE_URL=postgresql://postgres:password@localhost:5432/rag

Place your GGUF models in the appropriate directories under models/:

localRAG-api/
  ├── models/
  │   ├── embedding/          # Embedding models (e.g., all-MiniLM-L6-v2)
  │   ├── reranker/          # Cross-encoder reranking models (e.g., bge-reranker)
  │   └── chat/              # Chat models for local context generation

Option 1: Automatic Download

Windows:

scripts\download-models.bat

Linux/macOS:

chmod +x scripts/download-models.sh
./scripts/download-models.sh

Option 2: Manual Download

Download the following models and place them in their respective directories:

Llama-3.2-1B-Instruct-Q4_K_M.gguf
- Small instruction-tuned chat model for context generation (models/chat/)
all-MiniLM-L6-v2.Q4_K_M.gguf
- Efficient text embedding model for semantic search (models/embedding/)
bge-reranker-v2-m3-q8_0.gguf
- Cross-encoder model for accurate result reranking (models/reranker/)

Expected directory structure after download:

models/
├── chat/
│   └── Llama-3.2-1B-Instruct-Q4_K_M.gguf
├── embedding/
│   └── all-MiniLM-L6-v2.Q4_K_M.gguf
└── reranker/
    └── bge-reranker-v2-m3-q8_0.gguf

Start the services:

Using Docker (recommended)

docker compose up --build

This will start:

PostgreSQL with pgvector extension at localhost:5432
API server at http://localhost:57352 (configurable via PORT environment variable)

Manual Development

pnpm dev

Manual Production

pnpm start

Docker Setup

The project includes Docker configuration for easy deployment:

docker-compose.yml: Defines services for PostgreSQL with pgvector and the API server
Dockerfile: Multi-stage build for the Node.js API service using pnpm
.dockerignore: Excludes unnecessary files from the Docker build context

Environment variables and database connection will be automatically configured when using Docker.

Database Schema

The application uses PostgreSQL with the following schema:

CREATE TABLE dataset (
  id SERIAL PRIMARY KEY,
  file_id VARCHAR(32) NOT NULL,
  folder_id VARCHAR(32),
  context TEXT NOT NULL,
  context_embedding vector(384),
  content TEXT NOT NULL,
  content_embedding vector(384)
);

-- Create HNSW vector indexes for similarity search
CREATE INDEX context_embedding_idx ON dataset USING hnsw (context_embedding vector_cosine_ops);
CREATE INDEX content_embedding_idx ON dataset USING hnsw (content_embedding vector_cosine_ops);
-- Create indexes for file and folder lookups
CREATE INDEX file_id_idx ON dataset (file_id);
CREATE INDEX folder_id_idx ON dataset (folder_id);

API Endpoints

Stateless RAG Operations

`POST /v1/chunk`

Process document chunks and generate embeddings without persistence.

{
  "text": "your document text",
  "model": "embedding-model-name",
  "chunkSize": 500, // optional, default: 500
  "overlap": 50, // optional, default: 50
  "generateContexts": true, // optional, default: false
  "useOpenAI": false // optional, default: false
}

Response:

{
  "chunks": [
    {
      "content": "chunk text",
      "context": "generated context",
      "content_embedding": [...],
      "context_embedding": [...],
      "metadata": {
        "file_id": "",
        "folder_id": null,
        "has_context": true
      }
    }
  ]
}

`POST /v1/query`

Search across provided chunks with optional reranking.

{
  "query": "your search query",
  "chunks": [], // Array of chunks with embeddings from /chunk endpoint
  "embeddingModel": "model-name", // Required: model to use for query embedding
  "rerankerModel": "model-name", // Optional: model to use for reranking
  "topK": 4, // Optional: number of results to return
  "shouldRerank": true // Optional: whether to apply reranking
}

Response:

{
  "results": [
    {
      "content": "chunk text",
      "context": "chunk context",
      "metadata": {
        "file_id": "",
        "folder_id": null
      },
      "scores": {
        "content": 0.95,
        "context": 0.88,
        "combined": 0.92,
        "reranked": 0.96
      }
    }
  ]
}

Database-Backed RAG Operations

`POST /v1/store`

Store a document in the database. The document will be automatically chunked with context generation.

{
  "document": "full document text",
  "folder_id": "optional-folder-id", // optional
  "chunkSize": 500, // optional, default: 500
  "overlap": 50, // optional, default: 50
  "generateContexts": true, // optional, default: false
  "useOpenAI": false // optional, default: false
}

Response:

{
  "message": "Document chunks processed successfully",
  "file_id": "generated-file-id",
  "folder_id": "optional-folder-id",
  "chunks": [
    {
      "content": "chunk text",
      "context": "generated context",
      "content_embedding": [...],
      "context_embedding": [...],
      "metadata": {
        "document": "document name/id",
        "timestamp": "2024-02-05T06:15:21.000Z"
      }
    }
  ]
}

`POST /v1/retrieve`

Search across stored chunks with hybrid semantic search.

{
  "query": "your search query",
  "folder_id": "optional-folder-id",
  "top_k": 3, // Optional: default is 3
  "threshold": 0.0 // Optional: similarity threshold 0-1, default is 0.0
}

Response:

{
  "message": "Chunks retrieved successfully",
  "results": [
    {
      "content": "chunk text",
      "context": "chunk context",
      "metadata": {
        "file_id": "file-id",
        "folder_id": "folder-id"
      },
      "scores": {
        "content": 0.95,
        "context": 0.88,
        "combined": 0.92,
        "reranked": 0.96
      }
    }
  ]
}

The search uses a hybrid approach combining both content and context similarity:

Content similarity (60% weight): How well the chunk's content matches the query
Context similarity (40% weight): How well the chunk's context matches the query
Combined score: Weighted average of content and context similarities
Reranked score: Cross-encoder reranking applied to initial results

`GET /v1/documents`

List stored documents with paginated results. Provides document previews with their first chunks.

{
  "page": 1, // Optional: default is 1
  "pageSize": 10, // Optional: default is 10, max is 100
  "folder_id": "optional-folder-id", // Optional: filter by folder
  "file_id": "optional-file-id" // Optional: filter by file
}

Response:

{
  "message": "Documents retrieved successfully",
  "data": [
    {
      "file_id": "unique-file-id",
      "folder_id": "optional-folder-id",
      "content_preview": "first chunk content",
      "context_preview": "first chunk context"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "total_items": 50,
    "page_size": 10
  }
}

Response fields:

data: Array of documents with previews and metadata
pagination: Information about current page and total results

`POST /v1/delete`

Delete all chunks associated with a specific file_id.

{
  "file_id": "file_id_to_delete"
}

Response:

{
  "message": "Chunks deleted successfully",
  "file_id": "file_id_that_was_deleted"
}

Model Management

`POST /v1/models/load`

Pre-load a model into memory.

{
  "model": "model-name",
  "type": "embedding | reranker | chat"
}

Response:

{
  "message": "Model loaded successfully"
}

`POST /v1/models/unload`

Unload a model from memory.

{
  "model": "model-name"
}

Response:

{
  "message": "Model unloaded successfully"
}

or if model not found:

{
  "error": "Model not found or not loaded"
}

`GET /v1/models`

List all available models.

Response:

[
  {
    "name": "model-name",
    "type": "embedding | reranker | chat",
    "loaded": true
  }
]

Error Handling

All endpoints return appropriate HTTP status codes:

200: Success
400: Bad Request (missing/invalid parameters)
404: Not Found (model not found)
500: Internal Server Error

Error response format:

{
  "error": "Error description"
}

Examples

Stateless RAG with Node.js

async function searchChunks(text: string, query: string) {
  const API_URL = "http://localhost:57352/v1";

  // 1. Process document into chunks and get embeddings
  const chunkResponse = await fetch(`${API_URL}/chunk`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      text,
      model: "all-MiniLM-L6-v2",
      generateContexts: true,
      chunkSize: 500,
      overlap: 50,
    }),
  });
  const { chunks: processedChunks } = await chunkResponse.json();

  // 2. Search across chunks with reranking
  const queryResponse = await fetch(`${API_URL}/query`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      query,
      chunks: processedChunks,
      embeddingModel: "all-MiniLM-L6-v2",
      rerankerModel: "bge-reranker-base",
      topK: 4,
      shouldRerank: true,
    }),
  });
  const { results } = await queryResponse.json();

  return results;
}

With cURL

# List documents with pagination and filters
curl -X GET "http://localhost:57352/v1/documents?page=1&pageSize=10&folder_id=optional-folder-id"

# Process document into chunks (Stateless RAG)
curl -X POST http://localhost:57352/v1/chunk \
  -H "Content-Type: application/json" \
  -d '{
    "text": "your document text",
    "model": "all-MiniLM-L6-v2",
    "generateContexts": true,
    "chunkSize": 500,
    "overlap": 50
  }'

# Search across chunks with reranking (Stateless RAG)
curl -X POST http://localhost:57352/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "your search query",
    "chunks": [],
    "embeddingModel": "all-MiniLM-L6-v2",
    "rerankerModel": "bge-reranker-base",
    "topK": 4,
    "shouldRerank": true
  }'

Database-Backed RAG with Node.js

async function storeAndSearch(document: string, query: string) {
  const API_URL = "http://localhost:57352/v1";

  // 1. Store document in database (it will be automatically chunked)
  const storeResponse = await fetch(`${API_URL}/store`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      document,
      folder_id: "optional-folder-id", // Optional: for organizing documents
      chunkSize: 500, // Optional: customize chunk size
      overlap: 50, // Optional: customize overlap
      generateContexts: true, // Optional: enable context generation
      useOpenAI: false, // Optional: use OpenAI for context generation
    }),
  });
  const { file_id, chunks: processedChunks } = await storeResponse.json();

  // 2. Search across stored chunks
  const queryResponse = await fetch(`${API_URL}/retrieve`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      query,
      folder_id: "optional-folder-id", // Optional: search within folder
      top_k: 3,
      threshold: 0.7, // Only return matches with similarity > 0.7
    }),
  });
  const { results } = await queryResponse.json();

  return { results, file_id };
}

// Example: Delete stored chunks
async function deleteStoredChunks(fileId: string) {
  const API_URL = "http://localhost:57352/v1";

  const response = await fetch(`${API_URL}/delete`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      file_id: fileId,
    }),
  });
  const result = await response.json();
  console.log(`Deleted chunks for file ${result.file_id}`);
}

With cURL

# 1. Store document in database (Database RAG)
curl -X POST http://localhost:57352/v1/store \
  -H "Content-Type: application/json" \
  -d '{
    "document": "full document text",
    "folder_id": "optional-folder-id",
    "chunkSize": 500,
    "overlap": 50,
    "generateContexts": true,
    "useOpenAI": false
  }'

# 2. Search stored chunks
curl -X POST http://localhost:57352/v1/retrieve \
  -H "Content-Type: application/json" \
  -d '{
    "query": "search query",
    "folder_id": "optional-folder-id",
    "top_k": 3,
    "threshold": 0.7
  }'

# 3. Delete chunks using file_id
curl -X POST http://localhost:57352/v1/delete \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_id_from_store_response"
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
db		db
drizzle		drizzle
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
drizzle.config.ts		drizzle.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
start.sh		start.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Contextual RAG API Server

Overview

Features

Search Pipeline

Setup

Option 1: Automatic Download

Option 2: Manual Download

Using Docker (recommended)

Manual Development

Manual Production

Docker Setup

Database Schema

API Endpoints

Stateless RAG Operations

`POST /v1/chunk`

`POST /v1/query`

Database-Backed RAG Operations

`POST /v1/store`

`POST /v1/retrieve`

`GET /v1/documents`

`POST /v1/delete`

Model Management

`POST /v1/models/load`

`POST /v1/models/unload`

`GET /v1/models`

Error Handling

Examples

Stateless RAG with Node.js

With cURL

Database-Backed RAG with Node.js

With cURL

About

Releases

Packages

Languages

jiaweing/localRAG-api

Folders and files

Latest commit

History

Repository files navigation

Local Contextual RAG API Server

Overview

Features

Search Pipeline

Setup

Option 1: Automatic Download

Option 2: Manual Download

Using Docker (recommended)

Manual Development

Manual Production

Docker Setup

Database Schema

API Endpoints

Stateless RAG Operations

POST /v1/chunk

POST /v1/query

Database-Backed RAG Operations

POST /v1/store

POST /v1/retrieve

GET /v1/documents

POST /v1/delete

Model Management

POST /v1/models/load

POST /v1/models/unload

GET /v1/models

Error Handling

Examples

Stateless RAG with Node.js

With cURL

Database-Backed RAG with Node.js

With cURL

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

`POST /v1/chunk`

`POST /v1/query`

`POST /v1/store`

`POST /v1/retrieve`

`GET /v1/documents`

`POST /v1/delete`

`POST /v1/models/load`

`POST /v1/models/unload`

`GET /v1/models`

Packages