GraphMinds: Leveraging Large Language Models and Knowledge Graphs for Transparent and Efficient AI Systems
"I am convinced that the crux of the problem of learning is recognising relationships and being able to use them."
— Christopher Strachey in a letter to Alan Turing, 1954
GraphMinds addresses the challenges of processing unstructured data and leveraging indirect relationships within sensitive domains. Traditional AI systems, particularly those reliant on cloud infrastructures, present security risks, making local, secure analysis increasingly critical.
Initially developed as a Local Retrieval-Augmented Generation (RAG) system, GraphMinds evolved to tackle the limitations of handling unstructured data by utilising knowledge graphs (KGs) to structure information. This enables the system to infer and represent indirect relationships, enhancing the capabilities of Large Language Models (LLMs) in processing fragmented and complex datasets.
GraphMinds integrates advanced graph-based techniques with LLMs, facilitating the capture of indirect relationships within a knowledge graph. This unique approach improves the system's ability to analyse unstructured data, offering deeper insights and enhanced security for knowledge-intensive tasks.
Evaluations demonstrate GraphMinds' superiority in analyzing large, unstructured datasets, particularly in fields requiring comprehensive analysis, such as criminal investigations. This innovation underscores its potential as a powerful tool for secure and transparent data analysis.
- Graph-Based Relationship Mapping: Extracts direct and indirect relationships between entities from unstructured data and represents them in a knowledge graph.
- Secure Local AI Processing: Designed to operate securely in local environments, ensuring data confidentiality without reliance on cloud infrastructure.
- Embeddings and Similarity Matching: Uses sentence embeddings to compute similarities between user queries and document relationships.
- LLM Integration for Comprehensive Analysis: Integrates with advanced LLMs to generate human-readable answers from relationships and contextual data.
- Sentence Transformers: For generating sentence embeddings.
- NetworkX and PyVis: For graph representation and visualization.
- SciPy: For calculating cosine similarity between embeddings.
- Ollama Client: For interacting with the LLM.
- Pandas: For data manipulation and handling relationships.
git clone https://github.com/tirth8205/GraphMinds.git
cd GraphMinds
conda env create -f environment.yml
conda activate graphminds
conda list
The Ollama API provides the necessary interface for interacting with the LLMs (e.g., mistral-openorca
and zephyr:7b
). The Zephyr model is a series of fine-tuned versions of the Mistral and Mixtral models that are trained to act as helpful assistants. Zephyr is a 7B parameter model, distributed under the Apache license, available in both instruction-following and text completion variants.
To use the Ollama API:
-
Download and install the Ollama 5.1.4 API from the official Ollama website.
-
After installing the API, verify the installation using the following command to check the version:
ollama version
-
Once installed, you can download the required models like
mistral-openorca
andzephyr:7b
using Ollama's built-in commands:ollama pull mistral-openorca ollama pull zephyr:7b
-
Ensure that the models are correctly installed and ready for interaction:
ollama list
This will list all the available models that are ready to use with the project.
If you're planning to work in JupyterLab, you can start it with:
jupyter lab
Once you're done, you can deactivate the Conda environment by running:
conda deactivate
- If you need to install additional packages, you can do so within the activated environment using
conda install
orpip install
. - The Python version is not fixed in the environment file, so the latest compatible version of Python will be installed when the environment is created.
-
Set Up the Environment: After setting up the environment, open the
extract_graph.ipynb
Jupyter notebook and ensure the kernel is set to Knowledge Graph (the environment you just created). -
Prepare Data:
- Place your PDF file in the
input/
folder (PDFs only at the moment). - Open the notebook and update the file name in the script:
# Load the PDF document loader = PyPDFLoader("input/#FileName") # Replace #FileName with the name of your PDF file documents = loader.load() # Load the content of the PDF
- Place your PDF file in the
-
Run the Notebook: Execute the cells to process the PDF and generate the knowledge graph. Once the processing is done, the system will generate an interactive HTML file for querying the relationships within the document.
-
Example Query Script: You can also run a predefined script to query the document:
# Example query query = "#ENTER YOUR QUERY HERE" # Replace with your query response = answer_query_with_all_relationships(query, contentreplacedforchunk_dfg, df) # Print the response print(response)
-
Alternatively, Start a Chat: You can initiate an interactive chat as defined in the last cell of the notebook:
while True: # Get the user's query query = input("Ask your question: ").strip() # Check if the user wants to exit if query.lower() in ['exit', 'quit']: print("Ending the interaction. Goodbye!") break
The script will continuously prompt for questions until you type
exit
orquit
.
- Embedding Generation: Generates sentence embeddings for each relationship in the dataset by combining node, edge, and context data.
- Cosine Similarity Matching: Matches user queries with the relationships in the dataset using cosine similarity.
- Graph-Based Inference: Extracts direct and indirect relationships from the knowledge graph to provide context-rich answers.
- LLM-Powered Answer Generation: Uses an LLM to create natural language answers based on the relationships and context.
The core of GraphMinds lies in its ability to leverage knowledge graphs and integrate indirect relationships into its analysis. By combining LLMs with structured data from knowledge graphs, GraphMinds enhances the accuracy of insights generated from unstructured and fragmented information.
This project is licensed under the MIT License. See the LICENSE file for details.
This project is developed by Tirth Kanani under the supervision of Prof. Christopher Baber as part of the MSc program in Human-Computer Interaction at the University of Birmingham. Special thanks to the developers of tools such as Sentence Transformers, NetworkX, and PyVis for their invaluable contributions.
For more detailed insights and background on this project, you can access the full project report here.