Skip to content

Commit

Permalink
Poc new release (#1)
Browse files Browse the repository at this point in the history
* deleted requirements file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* deleted main file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* deleted env.example file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* modified .gitignore file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* deleted image folder files

Signed-off-by: Gianluca Capuzzi <[email protected]>

* deleted ingest folder files

Signed-off-by: Gianluca Capuzzi <[email protected]>

* modified LICENSE and README file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* added NOTICE file

Signed-off-by: Gianluca Capuzzi <[email protected]>

* added deps and src folders

Signed-off-by: Gianluca Capuzzi <[email protected]>

* added images files

Signed-off-by: Gianluca Capuzzi <[email protected]>

---------

Signed-off-by: Gianluca Capuzzi <[email protected]>
  • Loading branch information
gcapuzzi authored Mar 3, 2024
1 parent 69baa50 commit 1dd1854
Show file tree
Hide file tree
Showing 30 changed files with 11,202 additions and 1,005 deletions.
2 changes: 0 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +0,0 @@
.env
saved_models
26 changes: 23 additions & 3 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

Expand Down Expand Up @@ -127,6 +127,26 @@ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

This product bundles torch, which is available under a
"3-clause BSD" license. For details, see deps/torch/

This product bundles bitsandbytes, which is available under a
"MIT" license. For details, see deps/bitsandbytes/

This product bundles langchain, which is available under a
"MIT" license. For details, see deps/langchain/

This product bundles bs4, which is available under a
"MIT" license. For details, see deps/bs4/

This product bundles transformers, which is available under a
"Apache 2.0" license. For details, see deps/transformers/

This product bundles sentence-transformers, which is available under a
"Apache 2.0" license. For details, see deps/sentence-transformers/

This product bundles transformers, peft, accelerate, safetensors, sentencepiece, chromadb, sentence-transformers, gradio, sentence-transformers/all-mpnet-base-v2 model, filipealmeida/Mistral-7B-Instruct-v0.1-sharded model, which are available under an "Apache 2.0" license.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
Expand Down
40 changes: 40 additions & 0 deletions NOTICE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
The bitsandbytes dependecy has a NOTICE file containing the text below:
"The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.

We thank Fabio Cannizzo for this work on FastBinarySearch which is included in this project."

The sentence-transformers dependecy has a NOTICE file containing the text below:

"Copyright 2019
Ubiquitous Knowledge Processing (UKP) Lab
Technische Universität Darmstadt"

The bs4 dependecy has a NOTICE file containing the text below:

"Beautiful Soup is made available under the MIT license:

Copyright (c) 2004-2015 Leonard Richardson

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Beautiful Soup incorporates code from the html5lib library, which is
also made available under the MIT license. Copyright (c) 2006-2013
James Graham and other contributors"
122 changes: 95 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,114 @@
# Hyperledger QA PoC
# Hyperledger QA PoC version 2

This is a Proof-of-Concept application that allows you to ask questions to a python script chatbot, fine-tuned with Hyperledger Standard Documents.
I implemented this first version, as mentee, during the Hyperledger Mentorship Program 2023.
The scope of this Hyperledger Labs project is to support the users (users, developer, etc.) to their work, avoiding to wade through oceans of documents to find information they are looking for. We are implementing an open source conversational AI tool which replies to the questions related to specific context. This is a proof-of-concept software which allows to create a chatbot using Google Colab (or local notebook which requires GPU). Here the official Wiki page: [Hyperledger Labs aifaq](https://labs.hyperledger.org/labs/aifaq.html). Please, read also the [Antitrust Policy and the Code of Conduct](https://wiki.hyperledger.org/pages/viewpage.action?pageId=41587043).

## Use case
## Background

This NLP application allows people to access to the Hyperledger Standard Documentation.
The scope of the lab is to support the Hyperledger users (users, developer, etc.) to their work, avoiding to wade through oceans of documents to find information they are looking for. Large Language Models have yielded remarkable results, either pay and open source tools. Today we can implement a conversational AI tool which replies to questions related to specific context.
The system is an open source Jupyter Notebook (derived from here [medium.com](https://levelup.gitconnected.com/building-a-private-ai-chatbot-2c071f6715ad)) which implements an AI chatbot. The idea is to implement an open source framework/template, as example, for other communities. Last results in open LLMs allow to have good performance using common HW resources.\
Below the application architecture:

## Architecture
![LLM chatbot schema](/images/poc_schema_v2.png)

The model is XML-R pre-trained ([HuggingFace deepset/xlm-roberta-large-squad2](https://huggingface.co/deepset/xlm-roberta-large-squad2)) with SQuAD Dataset. Below the architecture of the model:\
![alt text](./images/xlm_r_architecture.drawio.png)
We use RAG (Retrieval Augmented Generation [arxiv.org](https://arxiv.org/abs/2312.10997)) for question answering use case. That technique aims to improve LLM answers by incorporating knowledge from external database (e.g. vector database).

## Pipeline
The image depicts two workflow:

In this PoC I use Haystack ([Haystack by Deepset](https://haystack.deepset.ai/)) to Build the QA pipeline.
Below an image of the architecture:\
![alt text](./images/architecture_modern_qa.drawio.png)
1. The data ingestion workflow
2. The chat workflow

I use Elastic Search ([Elastic Search website](https://www.elastic.co/)) as Retriever component.
During the ingestion phase, the system loads context documents and creates a vector database. In our case, the document sources are:

- An online software guide (readthedocs template)
- The GitHub issues and pull requests

After the first phase, the system is ready to reply to user questions.

Currently, we use the open source [HuggingFace Zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). But, in the future we want to investigate other open source models. Moreover, the User Interface uses [Gradio](https://www.gradio.app/).

## Open Source Version

The software is under Apache 2.0 License (please check LICENSE and NOTICE files included). For the dependencies, it is [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html) compliant: the **LICENSE** file contains "pointers" to the dependency's licenses and a list of Apache 2.0-licensed dependecies ([Assembling LICENSE and NOTICE files](https://infra.apache.org/licensing-howto.html#mod-notice)).

## Installation

For the installation istructions read the links below:\
[Haystack installation](https://haystack.deepset.ai/integrations/elasticsearch-document-store)
Below the main steps to set up the system:

1. Download the **hyperledger_aifaq_poc_v3.ipynb** notebook file from the **src** folder
2. Create a new Google Colab notebook
3. Load the downloaded notebook file
4. Set up the runtime GPU
5. Set the URL and GitHub repo document sources
6. Create a new GitHub personal token
7. Add the token, as new secret, to the Google Colab notebook

The first step is straightforward: just click the **src** folder to open it, then click the **hyperledger_aifaq_poc_v3.ipynb** file and the click the button below:

![download button](/images/download_notebook_file.png)

Now, in Google Drive click on **New** button -> **Other** and **Google Colaboratory**

![new Google Colab notebook](/images/new_colab_notebook.png)

Inside the new notebook, select the **File** menu, then select **Load notebook** and then click on the "Browse" button and select the downloaded file (hyperledger_aifaq_poc_v3.ipynb).

We need a GPU to execute the notebook. So, we can set it up from the **Runtime** menu, then change runtime:

![set up the runtime](/images/runtime_type.png)

If you have a free account you can use only the T4 GPU.

The notebook takes the documents for RAG from two sources:

1. An online website
2. A GitHub repository

The image below shows how to set them up:

![document sources](/images/document_sources.png)

In our case, we get the **Hyperledger Iroha** readthedocs guide and its GitHub repository (getting issues and pull requests).
Into **url** string we specify the website, while in **repo** string we set the GitHub repository.\\

From your personal GitHub account, inside the profile settings, select the developer settings:

![developer settings](/images/developer_settings.png)

Then select the **fine-grained token**

![fine-grained token](/images/fine_grained_token.png)

and click on the generate button: now copy the token.\\
Into the Google Colab notebook, select the **secret key** and add a new secret, like the image below:

![github personal token](/images/github_personal_token.png)

- The token must have the access to the notebook
- The name should be **GITHUB_PERSONAL_ACCESS_TOKEN**
- Past it inside the **Value** field

## Usage

Now, we can test the PoC by executing the notebook: in Google Colab notebook -> **Runtime menu**, select **Execute all**:

- It will take 5-15 minutes (it depends on the GPU and the documents)
- When the execution finishes, it loads an UI which allows to ask questions and replies in around 30 seconds

[Elastic Search Windows installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/zip-windows.html)
Below an example:

## Ingestion files
![UI Gradio example](/images/ui_gradio_question.png)

In ingest folder, you can find two kinds of files:
For any reason, please, contact us on Discord Channel:

1. es format (Elastic Search) which contains data for the unstructured documents
2. one squad format file ([Stanford Question Anwsering Dataset](https://huggingface.co/datasets/squad_v2)) for the fine-tuning process
- Server: Hyperledger
- Channel: #aifaq

## Current version notes

That is the first version of a PoC. Below a list of improvements that will be applied soon:
That is a proof-of-concept: a list of future improvement below:

1. Model: more sophisticated model (e.g. Zephyr 7B alpha)
2. Dataset: currently I implemented only 2 documents as example, but real systems work with hundreds of documents
3. Retriever: more sophisticated techniques use embeddings
4. QA type: I will use generative (RAG) instead of extractive QA
5. Hardware: now the system requires 10 minutes to ingest the files, GPU can help to save much time
1. We want to implement a prototype starting from that PoC: container architecture installed on a GPU Cloud Server
2. At the same time, we'd like to pass to the next step: the Hyperledger Incubation Stage
3. We will investigate other open source models
4. Evaluation of the system using standard metrics
5. We would like to improve the system, some ideas are: fine-tuning, Corrective RAG, Decomposed LoRA
6. Add "guardrails" which are a specific ways of controlling the output of a LLM, such as talking avoid specific topics, responding in a particular way to specific user requests, etc.
21 changes: 21 additions & 0 deletions deps/bitsandbytes/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) Facebook, Inc. and its affiliates.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
26 changes: 26 additions & 0 deletions deps/bs4/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Beautiful Soup is made available under the MIT license:

Copyright (c) 2004-2012 Leonard Richardson

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE, DAMMIT.

Beautiful Soup incorporates code from the html5lib library, which is
also made available under the MIT license.
21 changes: 21 additions & 0 deletions deps/langchain/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) LangChain, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Loading

0 comments on commit 1dd1854

Please sign in to comment.