Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rag pdf 2 #955

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
input/
input*/
output*/
final_output/
storage/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,31 @@ RAG conists of two phases
![](media/rag-overview-2.png)


### Step 1 (Ingest): Cleanup documents
### Step 1 (Ingest): Extract text from PDFs

Remove markups, perform de-duplication ..etc
We will extract text in markdown format from PDFs.

### Step 2 (Ingest): Perform any De duplication

### Step 2 (Ingest): Split into chunks
Eliminate any duplicate documents.

Split the documents into manageable chunks or segments. There are various chunking stratergies. Documents can be split into pages or paragraphs or sections. The right chunking strategy depends on the document types being processed
### Step 3 (Ingest): Split into chunks

Split the documents into manageable chunks or segments. There are various chunking stratergies. Documents can be split into pages or paragraphs or sections. The right chunking strategy depends on the document types being processed


### Step 3 (Ingest): Vectorize / Calculate Embeddings
### Step 4 (Ingest): Vectorize / Calculate Embeddings

In order to make text searchable, we need to 'vectorize' them. This is done by using **embedding models**. We will feature a variety of embedding models, open source ones and API based ones.



### Step 4 (Ingest). Saving Data into Vector Database
### Step 5 (Ingest). Saving Data into Vector Database

In order to effectivly retrieve relevant documents, we use [Milvus](https://milvus.io/) - a very popular open source, vector database.


### Step 5 (Query). Vectorize Question
### Step 6 (Query). Vectorize Question

When user asks a question, we are going to vectorize the question so we can fetch documents that **may** have the answer question.

Expand All @@ -40,12 +42,12 @@ So we want to retrieve the relevant documents first.



### Step 6 (Query): Vector Search
### Step 7 (Query): Vector Search

We send the 'vectorized query' to vector database to retrieve the relevant documents.


### Step 7 (Query): Retrieve Relevant Documents
### Step 8 (Query): Retrieve Relevant Documents

Vector database looks at our query (in vectorized form), searches through the documents and returns the documents matching the query.

Expand All @@ -54,14 +56,14 @@ This is an important step, because it **cuts down the 'search space'**. For exa
The search has to be accurate, as these are the documents sent to LLM as **'context'**. LLM will look through these documents searching for the answer to our question


### Step 8 (Query): Send relevant documents and query LLM
### Step 9 (Query): Send relevant documents and query LLM

We send the relevant documents (returned in the above step by Vector DB) and our query to LLM.

LLMs can be accessed via API or we can run one locally.


### Step 9 (Query): Answer from LLM
### Step 10 (Query): Answer from LLM

Now we get to see the answer provided by LLM 👏

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,40 +24,40 @@ Here is the overall work flow. For details see [RAG-explained](./RAG-explained.

![](media/rag-overview-2.png)

## Step-2: Process Input Documents (RAG stage 1, 2 & 3)
## Step-2: Process Input Documents (RAG stage 1, 2, 3 & 4)

This code uses DPK to

- Extract text from PDFs (RAG stage-1)
- Performs de-dupes (RAG stage-1)
- split the documents into chunks (RAG stage-2)
- vectorize the chunks (RAG stage-3)
- Performs de-dupes (RAG stage-2)
- split the documents into chunks (RAG stage-3)
- vectorize the chunks (RAG stage-4)

Here is the code:

- Python version: [rag_1A_dpk_process_python.ipynb](rag_1A_dpk_process_python.ipynb)
- Python version: [rag_1_dpk_process_python.ipynb](rag_1_dpk_process_python.ipynb)
- Ray version: [rag_1A_dpk_process_ray.ipynb](rag_1A_dpk_process_ray.ipynb)


## Step-3: Load data into vector database (RAG stage 4)
## Step-3: Load data into vector database (RAG stage 5)

Our vector database is [Milvus](https://milvus.io/)

Run the code: [rag_1B_load_data_into_milvus.ipynb](rag_1B_load_data_into_milvus.ipynb)
Run the code: [rag_2_load_data_into_milvus.ipynb](rag_2_load_data_into_milvus.ipynb)

Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step


## Step-4: Perform vector search (RAG stage 5 & 6)
## Step-4: Perform vector search (RAG stage 6, 7 & 8)

Let's do a few searches on our data.

Code: [rag_1C_vector_search.ipynb](rag_1C_vector_search.ipynb)
Code: [rag_3_vector_search.ipynb](rag_3_vector_search.ipynb)

Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step


## Step-5: Query the documents using LLM (RAG steps 5, 6, 7, 8 & 9)
## Step-5: Query the documents using LLM (RAG steps 9 & 10)

We will use **Llama** as our LLM running on [Replicate](https://replicate.com/) service.

Expand All @@ -76,24 +76,24 @@ REPLICATE_API_TOKEN=your REPLICATE token goes here

### 5.2 - Run the query code

Code: [rag_1D_query_replicate.ipynb](rag_1D_query_replicate.ipynb)
Code: [rag_4_query_replicate.ipynb](rag_4_query_replicate.ipynb)



## Step 6: Illama Index
## Step 6 (Optional): Illama Index

For comparision, we can use [Llama-index](https://docs.llamaindex.ai/) framework to process PDFs and query

### Step 6.1 - Process documents and save the index into vector DB

Code: [rag_2A_llamaindex_process.ipynb](rag_2A_llamaindex_process.ipynb)
Code: [rag_llamaindex_1_process.ipynb](rag_llamaindex_1_process.ipynb)

Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step


### Step 6.2 - Query documents with LLM

code: [rag_2B_llamaindex_query.ipynb](rag_2B_llamaindex_query.ipynb)
code: [rag_llamaindex_2_query.ipynb](rag_llamaindex_2_query.ipynb)


## Tips: Close the notebook kernels, to release the db.lock
Expand Down
Loading