Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce MinIO Object Store event based dataprep and retriever using Milvus and LanceDB #846

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

dilverse
Copy link

@dilverse dilverse commented Nov 3, 2024

Description

MinIO is a high performance Object Storage solution which is fully S3 compatible and using which its easier to build highly scalable applications. In this PR adding support for the following

  • In Dataprep storing all the uploaded documents directly to MinIO
  • Modularize the dataprep process
  • Once the document is uploaded to MinIO bucket an event notification is sent to the dataprep service to do the chunking and store the chunked metatdata as msgpack into MinIO Bucket
  • once the msgpack chunked metadata is stored in MInIO bucket another notification is send to do the embeddings process and store the chunks, metadata and embeddings to to vector database like Milvus and LanceDB.
  • Also adding support for MinIO LanceDB based retriever

Issues

n/a

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

List the newly introduced 3rd party dependency if exists.

@mkbhanda
Copy link
Collaborator

mkbhanda commented Nov 5, 2024

@lvliang-intel would you kindly review.

Copy link
Collaborator

@mkbhanda mkbhanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @chensuyue do we need any tests?

comps/dataprep/minio/milvus/langchain/README.md Outdated Show resolved Hide resolved
"parent": ""
},
{
"name": "uploaded_file_2.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to change one of these to a file path/url. What is the name of a file when it is path? id == name?

@chensuyue
Copy link
Collaborator

Yes, we need some one test for each microservice.
test_retrievers_minio_lancedb_langchain.sh
test_dataprep_minio_lancedb_langchain.sh
test_dataprep_minio_milvus_langchain.sh

@chensuyue
Copy link
Collaborator

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below are few minor comments on the docs and Python code.

As to Dockerfiles, see: opea-project/GenAIExamples#225

comps/dataprep/minio/lancedb/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/lancedb/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/lancedb/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/lancedb/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/milvus/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/milvus/langchain/README.md Outdated Show resolved Hide resolved
comps/dataprep/minio/minio_schema.py Outdated Show resolved Hide resolved
@lvliang-intel
Copy link
Collaborator

@dilverse,
Please update comps/dataprep/minio/milvus/langchain/Dockerfile, comps/dataprep/minio/lancedb/langchain/Dockerfile and comps/retrievers/minio/lancedb/langchain/Dockerfile to .github/workflows/docker/compose/dataprep-compose.yaml. The yaml is used for release images build.

@dilverse
Copy link
Author

@dilverse, Please update comps/dataprep/minio/milvus/langchain/Dockerfile, comps/dataprep/minio/lancedb/langchain/Dockerfile and comps/retrievers/minio/lancedb/langchain/Dockerfile to .github/workflows/docker/compose/dataprep-compose.yaml. The yaml is used for release images build.

@lvliang-intel Updated the github workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants