A lightweight & fully customizable API server for Contextual Retrieval-Augmented Generation (RAG) operations, supporting document chunking with context generation, multi-embedding semantic search, and reranking.
This service provides endpoints for implementing contextual RAG workflows:
- Stateless RAG: Stateless operations where you provide both documents and chunks in the request
- Process documents into chunks and generate embeddings
- Query against provided chunks with reranking
- Database RAG: Complete contextual RAG pipeline using PostgreSQL (PgVector)
- Document chunking with context-awareness
- Hybrid semantic search with context embeddings
- Flexible context generation using OpenAI or local models
- 🔍 Text chunking with configurable size and overlap
- 🧠 Optional context generation using OpenAI or local models
- 📈 Flexible embedding model selection:
- Choose models per request in stateless operations
- Configure default model for database operations
- 🎯 Hybrid semantic search with configurable weights (60/40 content/context)
- 🔄 Cross-encoder reranking for better relevance
- 📊 Highly configurable parameters for all operations
- 🚀 Efficient model management with auto-unloading
- 💾 Choose between stateless or database-backed operation
The search pipeline consists of two stages:
-
Initial Retrieval
- Generates embedding for the query
- Calculates cosine similarity for both content and context embeddings
- Combines similarities with weighted average (60% content, 40% context)
- Applies similarity threshold (if specified)
- Selects top_k most similar chunks
-
Reranking
- Uses cross-encoder model for more accurate relevance scoring
- Reranks the initial candidates
- Returns final ordered results
The two-stage approach combines the efficiency of embedding-based retrieval with the accuracy of cross-encoder reranking.
- Clone and set up:
git clone https://github.com/jiaweing/localRAG-api.git
cd localRAG-api
pnpm install
-
Set up PostgreSQL database:
- Install PostgreSQL if not already installed
- Create a new database for the application
- Run migrations with drizzle-kit (coming soon)
- Configure database connection in
.env
file
-
Configure environment variables:
cp .env.example .env
Required environment variables:
# Server Configuration
PORT=57352
# OpenAI Configuration (optional)
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL_NAME=gpt-4o-mini # or any other OpenAI model
# Default Models Configuration
EMBEDDING_MODEL=all-MiniLM-L6-v2.Q4_K_M # Model used for database RAG operations
# Database Configuration
DATABASE_URL=postgresql://postgres:password@localhost:5432/rag
- Place your GGUF models in the appropriate directories under
models/
:
localRAG-api/
├── models/
│ ├── embedding/ # Embedding models (e.g., all-MiniLM-L6-v2)
│ ├── reranker/ # Cross-encoder reranking models (e.g., bge-reranker)
│ └── chat/ # Chat models for local context generation
Windows:
scripts\download-models.bat
Linux/macOS:
chmod +x scripts/download-models.sh
./scripts/download-models.sh
Download the following models and place them in their respective directories:
-
Llama-3.2-1B-Instruct-Q4_K_M.gguf
- Small instruction-tuned chat model for context generation (
models/chat/
)
- Small instruction-tuned chat model for context generation (
-
- Efficient text embedding model for semantic search (
models/embedding/
)
- Efficient text embedding model for semantic search (
-
- Cross-encoder model for accurate result reranking (
models/reranker/
)
- Cross-encoder model for accurate result reranking (
Expected directory structure after download:
models/
├── chat/
│ └── Llama-3.2-1B-Instruct-Q4_K_M.gguf
├── embedding/
│ └── all-MiniLM-L6-v2.Q4_K_M.gguf
└── reranker/
└── bge-reranker-v2-m3-q8_0.gguf
- Start the services:
docker compose up --build
This will start:
- PostgreSQL with pgvector extension at localhost:5432
- API server at http://localhost:57352 (configurable via PORT environment variable)
pnpm dev
pnpm start
The project includes Docker configuration for easy deployment:
docker-compose.yml
: Defines services for PostgreSQL with pgvector and the API serverDockerfile
: Multi-stage build for the Node.js API service using pnpm.dockerignore
: Excludes unnecessary files from the Docker build context
Environment variables and database connection will be automatically configured when using Docker.
The application uses PostgreSQL with the following schema:
CREATE TABLE dataset (
id SERIAL PRIMARY KEY,
file_id VARCHAR(32) NOT NULL,
folder_id VARCHAR(32),
context TEXT NOT NULL,
context_embedding vector(384),
content TEXT NOT NULL,
content_embedding vector(384)
);
-- Create HNSW vector indexes for similarity search
CREATE INDEX context_embedding_idx ON dataset USING hnsw (context_embedding vector_cosine_ops);
CREATE INDEX content_embedding_idx ON dataset USING hnsw (content_embedding vector_cosine_ops);
-- Create indexes for file and folder lookups
CREATE INDEX file_id_idx ON dataset (file_id);
CREATE INDEX folder_id_idx ON dataset (folder_id);
Process document chunks and generate embeddings without persistence.
{
"text": "your document text",
"model": "embedding-model-name",
"chunkSize": 500, // optional, default: 500
"overlap": 50, // optional, default: 50
"generateContexts": true, // optional, default: false
"useOpenAI": false // optional, default: false
}
Response:
{
"chunks": [
{
"content": "chunk text",
"context": "generated context",
"content_embedding": [...],
"context_embedding": [...],
"metadata": {
"file_id": "",
"folder_id": null,
"has_context": true
}
}
]
}
Search across provided chunks with optional reranking.
{
"query": "your search query",
"chunks": [], // Array of chunks with embeddings from /chunk endpoint
"embeddingModel": "model-name", // Required: model to use for query embedding
"rerankerModel": "model-name", // Optional: model to use for reranking
"topK": 4, // Optional: number of results to return
"shouldRerank": true // Optional: whether to apply reranking
}
Response:
{
"results": [
{
"content": "chunk text",
"context": "chunk context",
"metadata": {
"file_id": "",
"folder_id": null
},
"scores": {
"content": 0.95,
"context": 0.88,
"combined": 0.92,
"reranked": 0.96
}
}
]
}
Store a document in the database. The document will be automatically chunked with context generation.
{
"document": "full document text",
"folder_id": "optional-folder-id", // optional
"chunkSize": 500, // optional, default: 500
"overlap": 50, // optional, default: 50
"generateContexts": true, // optional, default: false
"useOpenAI": false // optional, default: false
}
Response:
{
"message": "Document chunks processed successfully",
"file_id": "generated-file-id",
"folder_id": "optional-folder-id",
"chunks": [
{
"content": "chunk text",
"context": "generated context",
"content_embedding": [...],
"context_embedding": [...],
"metadata": {
"document": "document name/id",
"timestamp": "2024-02-05T06:15:21.000Z"
}
}
]
}
Search across stored chunks with hybrid semantic search.
{
"query": "your search query",
"folder_id": "optional-folder-id",
"top_k": 3, // Optional: default is 3
"threshold": 0.0 // Optional: similarity threshold 0-1, default is 0.0
}
Response:
{
"message": "Chunks retrieved successfully",
"results": [
{
"content": "chunk text",
"context": "chunk context",
"metadata": {
"file_id": "file-id",
"folder_id": "folder-id"
},
"scores": {
"content": 0.95,
"context": 0.88,
"combined": 0.92,
"reranked": 0.96
}
}
]
}
The search uses a hybrid approach combining both content and context similarity:
- Content similarity (60% weight): How well the chunk's content matches the query
- Context similarity (40% weight): How well the chunk's context matches the query
- Combined score: Weighted average of content and context similarities
- Reranked score: Cross-encoder reranking applied to initial results
List stored documents with paginated results. Provides document previews with their first chunks.
{
"page": 1, // Optional: default is 1
"pageSize": 10, // Optional: default is 10, max is 100
"folder_id": "optional-folder-id", // Optional: filter by folder
"file_id": "optional-file-id" // Optional: filter by file
}
Response:
{
"message": "Documents retrieved successfully",
"data": [
{
"file_id": "unique-file-id",
"folder_id": "optional-folder-id",
"content_preview": "first chunk content",
"context_preview": "first chunk context"
}
],
"pagination": {
"current_page": 1,
"total_pages": 5,
"total_items": 50,
"page_size": 10
}
}
Response fields:
data
: Array of documents with previews and metadatapagination
: Information about current page and total results
Delete all chunks associated with a specific file_id.
{
"file_id": "file_id_to_delete"
}
Response:
{
"message": "Chunks deleted successfully",
"file_id": "file_id_that_was_deleted"
}
Pre-load a model into memory.
{
"model": "model-name",
"type": "embedding | reranker | chat"
}
Response:
{
"message": "Model loaded successfully"
}
Unload a model from memory.
{
"model": "model-name"
}
Response:
{
"message": "Model unloaded successfully"
}
or if model not found:
{
"error": "Model not found or not loaded"
}
List all available models.
Response:
[
{
"name": "model-name",
"type": "embedding | reranker | chat",
"loaded": true
}
]
All endpoints return appropriate HTTP status codes:
- 200: Success
- 400: Bad Request (missing/invalid parameters)
- 404: Not Found (model not found)
- 500: Internal Server Error
Error response format:
{
"error": "Error description"
}
async function searchChunks(text: string, query: string) {
const API_URL = "http://localhost:57352/v1";
// 1. Process document into chunks and get embeddings
const chunkResponse = await fetch(`${API_URL}/chunk`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text,
model: "all-MiniLM-L6-v2",
generateContexts: true,
chunkSize: 500,
overlap: 50,
}),
});
const { chunks: processedChunks } = await chunkResponse.json();
// 2. Search across chunks with reranking
const queryResponse = await fetch(`${API_URL}/query`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
query,
chunks: processedChunks,
embeddingModel: "all-MiniLM-L6-v2",
rerankerModel: "bge-reranker-base",
topK: 4,
shouldRerank: true,
}),
});
const { results } = await queryResponse.json();
return results;
}
# List documents with pagination and filters
curl -X GET "http://localhost:57352/v1/documents?page=1&pageSize=10&folder_id=optional-folder-id"
# Process document into chunks (Stateless RAG)
curl -X POST http://localhost:57352/v1/chunk \
-H "Content-Type: application/json" \
-d '{
"text": "your document text",
"model": "all-MiniLM-L6-v2",
"generateContexts": true,
"chunkSize": 500,
"overlap": 50
}'
# Search across chunks with reranking (Stateless RAG)
curl -X POST http://localhost:57352/v1/query \
-H "Content-Type: application/json" \
-d '{
"query": "your search query",
"chunks": [],
"embeddingModel": "all-MiniLM-L6-v2",
"rerankerModel": "bge-reranker-base",
"topK": 4,
"shouldRerank": true
}'
async function storeAndSearch(document: string, query: string) {
const API_URL = "http://localhost:57352/v1";
// 1. Store document in database (it will be automatically chunked)
const storeResponse = await fetch(`${API_URL}/store`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
document,
folder_id: "optional-folder-id", // Optional: for organizing documents
chunkSize: 500, // Optional: customize chunk size
overlap: 50, // Optional: customize overlap
generateContexts: true, // Optional: enable context generation
useOpenAI: false, // Optional: use OpenAI for context generation
}),
});
const { file_id, chunks: processedChunks } = await storeResponse.json();
// 2. Search across stored chunks
const queryResponse = await fetch(`${API_URL}/retrieve`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
query,
folder_id: "optional-folder-id", // Optional: search within folder
top_k: 3,
threshold: 0.7, // Only return matches with similarity > 0.7
}),
});
const { results } = await queryResponse.json();
return { results, file_id };
}
// Example: Delete stored chunks
async function deleteStoredChunks(fileId: string) {
const API_URL = "http://localhost:57352/v1";
const response = await fetch(`${API_URL}/delete`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
file_id: fileId,
}),
});
const result = await response.json();
console.log(`Deleted chunks for file ${result.file_id}`);
}
# 1. Store document in database (Database RAG)
curl -X POST http://localhost:57352/v1/store \
-H "Content-Type: application/json" \
-d '{
"document": "full document text",
"folder_id": "optional-folder-id",
"chunkSize": 500,
"overlap": 50,
"generateContexts": true,
"useOpenAI": false
}'
# 2. Search stored chunks
curl -X POST http://localhost:57352/v1/retrieve \
-H "Content-Type: application/json" \
-d '{
"query": "search query",
"folder_id": "optional-folder-id",
"top_k": 3,
"threshold": 0.7
}'
# 3. Delete chunks using file_id
curl -X POST http://localhost:57352/v1/delete \
-H "Content-Type: application/json" \
-d '{
"file_id": "file_id_from_store_response"
}'