Skip to content

Latest commit

 

History

History
456 lines (238 loc) · 13.2 KB

File metadata and controls

456 lines (238 loc) · 13.2 KB

How to create a powerful knowledgebase chatbot with unstructured data

Chatbots aren’t new, but better chatbots are finally here thanks to AI services. Searching documentation is greatly enhanced with vector search instead of traditional keyword searches. And while training models on unstructured data may seem daunting, using RAG (Retrieval Augmented Generation) we can create a savvier chatbot that’s efficiently trained with new information.

The purpose of this solution is to show how to train a chatbot on a rich knowledge base of several types of documents, PDF files, and data stored in Database tables. This tutorial will show how to transform raw documents containing unstructured data into structured data, store them in OCI Object storage bucket, and utilize advanced AI models to generate contextual responses from natural language queries. Best of all, it’s highly modular so you can use a variety of models.

Table of contents

  1. Getting Access to Generative AI Agents
  2. Create an OCI Object Storage Bucket and Knowledge Repository.
  3. Create Knowledge Base
  4. Create Data Source
  5. Start Ingestion Job
  6. Create Generative AI Agent
  7. Training PDFs and Chat Interface
  8. How do we get data from the Oracle database, generate a PDF and train it?
  9. Uploading the PDF into OCI Object storage
  10. Training new documents in the knowledge repository
  11. How to upload .CSV file to object storage using dbms_cloud.export_data
  12. Creating Web User interface with Streamlit & Python
  13. Combining Internal Knowledge Base search with Oracle Generative AI LLM and Oracle Database
  14. Conclusion

01. Getting Access to Generative AI Agents

You can get access to Generative AI Agents resources with OCI Identity and Access Management (IAM) policies.

By default, only users in the Administrators group have access to all OCI resources, including Generative AI Agents resources. If you're a member of another group, ask your administrator to assign you the least privileges that are required to perform your responsibilities by reviewing the following sections.

For this, we would need our tenancy to subscribe to Oracle Generative AI Agents (Beta). If you do not find it in the other cloud areas, please check the Chicago region.

Architecture

Architecture


Demo Video for this Article

Demo Video


02. Create an OCI Object Storage Bucket and Knowledge Repository.

Login to cloud.oracle.com and from the top navigation, Under the top left navigation, select Storage and Buckets

Click on Create Bucket

Provide Bucket Name Standard tier storage, let other options be default and click on Create.

Upload PDF files to this object storage bucket.


Under the Top Navigation Menu, select Analytics and AI > AI Services > Generative AI Agents (new Beta)


Click on Knowledge Bases and Create a Knowledge Base


03. Create Knowledge Base

Under Create knowledge base, provide name, select compartment and select Object storage. You can also choose OCI OpenSearch. However, we are covering Object storage in this article


04. Create Data Source

Click on the create Data Source button

Provide Data source name, and type will be Object storage, Select the bucket name in the compartment, select all in the bucket or you can select only the required PDFs in the OCI Object storage bucket.

05. Start Ingestion Job

Check on automatically starting the ingestion job for the above data sources. Please note you can add multiple data sources here. Click on create button.


06. Create Generative AI Agent

From the left navigation, click on Create agent

Provide the agent name, select the knowledge base, and provide a chat welcome message.

Click on the Create button

Now, our chat agent is ready and trained on internal PDF data from OCI object storage.


07. Training PDFs and Chat Interface

From the left navigation, click on the chat button, and now you are all set to ask questions or have a conversation with the chatbot.

I have trained my chatbot on the following PDFs Breast cancer facts & figures

and lets ask questions, we will be greeted by a welcome message

Chatbot: Hi user, I am your friend Ask AI how can i help you today?

User: what is Breast cancer?


User: How is breast cancer diagnosed?

Let us now check another PDF on COVID-19 Corona Virus FAQs

Disclaimer: I have provided the source to download these PDFs, and I don't own the content on any of these PDF files. They are just used for demonstration and training purposes only for this article.

User: What is coronavirus and COVID-19?

You can click on View citation, if you would like to know which PDF has been referred to answer this chat.

User: How to prevent covid-19 infection?


08. Data from the Oracle Database, Generate a PDF and train it?

We can quickly spin up an Oracle APEX instance. or use your existing Oracle APEX instance

Create a Oracle APEX page and create a report using the example SQL query shown below.

select ID, CATEGORY, STATES_NAME_EN,
REGION_EN,   NAME_EN, SHORT_DESCRIPTION_EN 
from UNESCO_SITES where rownum < 100

You can download the UNESCO CSV file to create table from my GitHub Repo. (PDF file)

Run the Oracle APEX page with Interactive report as shown below.

Download the PDF report as shown, alternatively you write PL/SQL procedure to do the same that is to create a PDF and upload it an Object storage

The PDF will be as shown below


09. Uploading the PDF into OCI Object storage

You can use Oracle APEX to create a connection to OCI and Upload PDF directly to OCI Object storage, this is not covered in this article.

Please refer to my LiveLabs on AI for Healthcare on how to upload PDF files into Object storage using PL/SQL procedure.


10. Training new documents in the knowledge repository

Upload our new PDFs into the Object storage bucket (of the Data source that has been selected)

Click on our Data source and create a new Ingestion Jobs

Provide the job name and click on create button.

So now we should be good with our chatbot training. with a new PDF and search again

User: describe Kakadu National Park

User: please tell me about Monasteries of Haghpat and Sanahin

user: where is Quebrada de Humachuaca

This looks great with all the search results scanning our repositories, which consist of scanned PDF files and data coming from Oracle Database tables exported using Oracle APEX.


Important Note (**reference)**

  • PDF and txt files are the only supported bucket objects in Generative AI Agents.
  • If your data is not ready, you can point the data source to empty folders in a bucket and later, populate the folders with data. After you populate the folders with data, you can ingest the data into the data source.

11. (Optional) How to upload .CSV file to object storage using dbms_cloud.export_data

This is purely optional; you can skip this section as well.

What if you want to create a .CSV file from a SQL query and directly upload the file into OCI Object storage using PL/SQL?

For example, If we want to create a file with following SQL

select NAME_EN, SHORT_DESCRIPTION_EN, CATEGORY from DEMOUSER.UNESCO_SITES where rownum < 100

DBMS_CLOUD.EXPORT_DATA is a handy package to know and can be very helpful.


Login SQL Web Developer as an ADMIN user and Grant the following privileges, assuming owns the table from where you want to create .csv files

-- Login as ADMIN User 

grant execute on DBMS_CLOUD to DEMOUSER;
grant execute on DBMS_CLOUD_AI to DEMOUSER;

Create Credential with OCI API key using DBMS_CLOUD.CREATE_CREDENTIAL

-- replace the values based on your OCI cloud tenancy and User settings

BEGIN                                                                         
  DBMS_CLOUD.CREATE_CREDENTIAL(                                               
    credential_name => '<credential-name>',                                          
    user_ocid       => '<replace with your OCI user OCID>',
    tenancy_ocid    => '<replace with your OCI tenancy OCID>',
    private_key     => '<replace with your OCI private key>',
    fingerprint     => '<replace with your fingerprint>''      
  );                                                                          
END;                                                                         
/

From PL/SQL, create a .csv and directly upload the file to OCI Object storage using DBMS_CLOUD.EXPORT_DATA

-- Replace tenancy namespace, bucket name and file name as per your requirements

BEGIN
  DBMS_CLOUD.EXPORT_DATA (
    credential_name => '<credential-name>',
    file_uri_list => 'https://objectstorage.<region-identifier>.oraclecloud.com/n/<tenancy-namespace>/b/<bucket-name>/o/<file-name>.csv',
    format => '{"type":"CSV","delimiter":",","maxfilesize":536870912,"header":true,"compression":null,"escape":"true","quote":"\""}',
    query => 'select NAME_EN, SHORT_DESCRIPTION_EN,CATEGORY from DEMOUSER.UNESCO_SITES where rownum < 100');
END;
/

So, this will not only create a .csv file but also will upload to OCI object storage bucket


12. Creating Web User interface with Streamlit

Download source code zip from Oracle Generative AI Playground

and extract the genai_playground-main.zip file

Edit secrets.toml, it should be as shown below, please change according to your tenancy

-- secrets.toml file

endpoint = "https://agent-runtime.generativeai.us-chicago-1.oci.oraclecloud.com"  
agent_endpoint_id = "ocid1.genaiagentendpoint.oc1.us-chicago-1.<your-agent-ocid>"  
llm_endpoint = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"  
compartment_id = "<your-compartment-ocid>"  
logo = "Oracle.png"  
user_avatar =  ":material/record_voice_over:"  
assisstant_avatar =  "o.png" 

Upgrade PIP if required (I am running MacOS) with python3.10

[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: pip install --upgrade pip
madhusudhanrao@MadhuMac genai_playground-main % pip install --upgrade pip

Run the requirements.txt

pip install -r requirements.txt

Install OCI Command Line Interface - OIC CLI (Official Guide)

You will also need to complete the OCI CLI configuration task, please refer this article if required

Run the application

madhusudhanrao@MadhuMac genai_playground-main % streamlit run Home.py

Open the network ports 8501, if you are running this on an external cloud server

sudo iptables -I INPUT 6 -m state --state NEW -p tcp --dport 8501 -j ACCEPT

Another example

Creating Dark Theme UI, Update Config.toml

[theme]
primaryColor="white" #ocean #2C5967
backgroundColor="black" #neutral 1 #F5F4F2
secondaryBackgroundColor="#DFDCD8" #neutral 2
textColor="white" #obark #312D2A
font="sans serif"

and run the server again

streamlit run Home.py