diff --git a/examples/multimodal/multimodal_contextual_retrieval_rag.ipynb b/examples/multimodal/multimodal_contextual_retrieval_rag.ipynb new file mode 100644 index 0000000..34e7281 --- /dev/null +++ b/examples/multimodal/multimodal_contextual_retrieval_rag.ipynb @@ -0,0 +1,1009 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "93ae9bad-b8cc-43de-ba7d-387e0155674c", + "metadata": {}, + "source": [ + "# Contextual Retrieval for Multimodal RAG\n", + "\n", + "\"Open\n", + "\n", + "In this cookbook we show you how to build a multimodal RAG pipeline with **contextual retrieval**.\n", + "\n", + "Contextual retrieval was initially introduced in this Anthropic [blog post](https://www.anthropic.com/news/contextual-retrieval). The high-level intuition is that every chunk is given a concise summary of where that chunk fits in with respect to the overall summary of the document. This allows insertion of high-level concepts/keywords that enable this chunk to be better retrieved for different types of queries.\n", + "\n", + "These LLM calls are expensive. Contextual retrieval depends on **prompt caching** in order to be efficient.\n", + "\n", + "In this notebook, we use Claude 3.5-Sonnet to generate contextual summaries. We cache the document as text tokens, but generate contextual summaries by feeding in the parsed text chunk. \n", + "\n", + "We feed both the text and image chunks into the final multimodal RAG pipeline to generate the response.\n", + "\n", + "![mm_rag_diagram](./multimodal_contextual_retrieval_rag_img.png)" + ] + }, + { + "cell_type": "markdown", + "id": "54e8d9a7-5036-4d32-818f-00b2e888521f", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70ccdd53-e68a-4199-aacb-cfe71ad1ff0b", + "metadata": {}, + "outputs": [], + "source": [ + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()" + ] + }, + { + "cell_type": "markdown", + "id": "225c5556-a789-4386-a1ee-cce01dbeb6cf", + "metadata": {}, + "source": [ + "### Setup Observability\n", + "\n", + "We setup an integration with LlamaTrace (integration with Arize).\n", + "\n", + "If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0eabee1f-290a-4c85-b362-54f45c8559ae", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -U llama-index-callbacks-arize-phoenix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aaeb245c-730b-4c34-ad68-708fdde0e6cb", + "metadata": {}, + "outputs": [], + "source": [ + "# setup Arize Phoenix for logging/observability\n", + "import llama_index.core\n", + "import os\n", + "\n", + "PHOENIX_API_KEY = \"\"\n", + "os.environ[\"OTEL_EXPORTER_OTLP_HEADERS\"] = f\"api_key={PHOENIX_API_KEY}\"\n", + "llama_index.core.set_global_handler(\n", + " \"arize_phoenix\", endpoint=\"https://llamatrace.com/v1/traces\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fbb362db-b1b1-4eea-be1a-b1f78b0779d7", + "metadata": {}, + "source": [ + "### Load Data\n", + "\n", + "Here we load the [ICONIQ 2024 State of AI Report](https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQ%20Analytics%20%2B%20Insights%20-%20State%20of%20AI%20Sep24.pdf)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8bce3407-a7d2-47e8-9eaf-ab297a94750c", + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir data\n", + "!mkdir data_images_iconiq\n", + "!wget \"https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQ%20Analytics%20%2B%20Insights%20-%20State%20of%20AI%20Sep24.pdf\" -O data/iconiq_report.pdf" + ] + }, + { + "cell_type": "markdown", + "id": "246ba6b0-51af-42f9-b1b2-8d3e721ef782", + "metadata": {}, + "source": [ + "### Model Setup\n", + "\n", + "Setup models that will be used for downstream orchestration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2957a512-215e-4ba2-ae1a-95e612e0f72b", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# replace with your Anthropic API key\n", + "os.environ[\"ANTHROPIC_API_KEY\"] = \"sk-...\"\n", + "# replace with your VoyageAI key\n", + "os.environ[\"VOYAGE_API_KEY\"] = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16e2071d-bbc2-4707-8ae7-cb4e1fecafd3", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.llms.anthropic import Anthropic\n", + "from llama_index.embeddings.voyageai import VoyageEmbedding\n", + "from llama_index.core import Settings\n", + "\n", + "\n", + "llm = Anthropic(model=\"claude-3-5-sonnet-20240620\")\n", + "embed_model = VoyageEmbedding(model_name=\"voyage-3\")\n", + "\n", + "Settings.llm = llm\n", + "Settings.embed_model = embed_model" + ] + }, + { + "cell_type": "markdown", + "id": "e3f6416f-f580-4722-aaa9-7f3500408547", + "metadata": {}, + "source": [ + "## Use LlamaParse to Parse Text and Images\n", + "\n", + "In this example, use LlamaParse to parse both the text and images from the document.\n", + "\n", + "We parse out the text with LlamaParse premium.\n", + "\n", + "**NOTE**: The report has 40 pages, and at ~5c per page, this will cost you $2. Just a heads up!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "570089e5-238a-4dcc-af65-96e7393c2b4d", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_parse import LlamaParse\n", + "\n", + "\n", + "parser = LlamaParse(\n", + " result_type=\"markdown\",\n", + " premium_mode=True,\n", + " # invalidate_cache=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ef82a985-4088-4bb7-9a21-0318e1b9207d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Parsing text...\n", + "Started parsing the file under job_id a578c42a-706c-4fc8-8f60-231bc2fca434\n" + ] + } + ], + "source": [ + "print(f\"Parsing text...\")\n", + "md_json_objs = parser.get_json_result(\"data/iconiq_report.pdf\")\n", + "md_json_list = md_json_objs[0][\"pages\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5318fb7b-fe6a-4a8a-b82e-4ed7b4512c37", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "# A Decision-Making Framework\n", + "\n", + "When making decisions around GenAI investments, we believe it will be important to assess organization readiness, put in place a framework and processes for use case evaluation, and proactively mitigate risks\n", + "\n", + "## Accelerate Value\n", + "Find synergies between organizational readiness, use cases, and risk mitigation when making GenAI investment decisions\n", + "\n", + "### Use Case Identification & Evaluation\n", + "When determining use cases for GenAI, we believe stakeholders will need to assess business value, the fluency vs. accuracy of solutions, and the level of risk associated. Given the risks involved with using GenAI to build new products, many organizations are first starting with use cases for internal productivity.\n", + "\n", + "It is also important to implement feedback loops and a system for measuring ROI to evaluate use cases.\n", + "\n", + "### Organizational Readiness\n", + "For enterprises adopting GenAI solutions for the first time, we believe it will be important to ensure various components of the organization are ready to support the development and integration needs involved. Organizational readiness components to assess could include:\n", + "\n", + "- Employee readiness and training\n", + "- IT / data team expertise\n", + "- Security\n", + "- Governance structure and policies\n", + "- Data ecosystem maturity\n", + "\n", + "### Risk Mitigation\n", + "We believe enterprises will need to account for various risks like data security and privacy concerns, algorithm accuracy / bias, integration complexity, etc. when evaluating GenAI solutions.\n", + "\n", + "Organizations can employ various strategies to mitigate some of these risks. For example, it may make sense to invest in fine-tuning or retrieval augmented generation (RAG) techniques to mitigate concerns of model accuracy.\n", + "\n", + "Source: Perspectives from the ICONIQ Growth GenAI Survey (June 2024) and perspectives from the ICONIQ Growth team and network of AI leaders consisting of our community of CIO/CDOs overseeing AI initiatives in enterprises, CTOs, our Technical Advisory Board, and others in our network\n", + "\n", + "Private & Strictly Confidential\n" + ] + } + ], + "source": [ + "print(md_json_list[10][\"md\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eeadb16c-97eb-4622-9551-b34d7f90d72f", + "metadata": {}, + "outputs": [], + "source": [ + "image_dicts = parser.get_images(md_json_objs, download_path=\"data_images_iconiq\")" + ] + }, + { + "cell_type": "markdown", + "id": "fd3e098b-0606-4429-b48d-d4fe0140fc0e", + "metadata": {}, + "source": [ + "## Build Multimodal Index\n", + "\n", + "In this section we build the multimodal index over the parsed deck. \n", + "\n", + "We do this by creating **text** nodes from the document that contain metadata referencing the original image path.\n", + "\n", + "In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot." + ] + }, + { + "cell_type": "markdown", + "id": "3aae2dee-9d85-4604-8a51-705d4db527f7", + "metadata": {}, + "source": [ + "#### Get Text Nodes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18c24174-05ce-417f-8dd2-79c3f375db03", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core.schema import TextNode\n", + "from typing import Optional" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e331dfe-a627-4e23-8c57-70ab1d9342e4", + "metadata": {}, + "outputs": [], + "source": [ + "# get pages loaded through llamaparse\n", + "import re\n", + "\n", + "\n", + "def get_page_number(file_name):\n", + " match = re.search(r\"-page_(\\d+)\\.jpg$\", str(file_name))\n", + " if match:\n", + " return int(match.group(1))\n", + " return 0\n", + "\n", + "\n", + "def _get_sorted_image_files(image_dir):\n", + " \"\"\"Get image files sorted by page.\"\"\"\n", + " raw_files = [\n", + " f for f in list(Path(image_dir).iterdir()) if f.is_file() and \"-page\" in str(f)\n", + " ]\n", + " sorted_files = sorted(raw_files, key=get_page_number)\n", + " return sorted_files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "346fe5ef-171e-4a54-9084-7a7805103a13", + "metadata": {}, + "outputs": [], + "source": [ + "from copy import deepcopy\n", + "from pathlib import Path\n", + "\n", + "\n", + "# attach image metadata to the text nodes\n", + "def get_text_nodes(image_dir, json_dicts):\n", + " \"\"\"Split docs into nodes, by separator.\"\"\"\n", + " nodes = []\n", + "\n", + " image_files = _get_sorted_image_files(image_dir)\n", + " md_texts = [d[\"md\"] for d in json_dicts]\n", + "\n", + " for idx, md_text in enumerate(md_texts):\n", + " chunk_metadata = {\"page_num\": idx + 1}\n", + " chunk_metadata[\"image_path\"] = str(image_files[idx])\n", + " chunk_metadata[\"parsed_text_markdown\"] = md_texts[idx]\n", + " node = TextNode(\n", + " text=\"\",\n", + " metadata=chunk_metadata,\n", + " )\n", + " nodes.append(node)\n", + "\n", + " return nodes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f591669c-5a8e-491d-9cef-0b754abbf26f", + "metadata": {}, + "outputs": [], + "source": [ + "# this will split into pages\n", + "text_nodes = get_text_nodes(image_dir=\"data_images_iconiq\", json_dicts=md_json_list)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32c13950-c1db-435f-b5b4-89d62b8b7744", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "page_num: 1\n", + "image_path: data_images_iconiq/11f19cc3-c02e-4271-a84f-9a043457fd69-page_1.jpg\n", + "parsed_text_markdown: September 2024\n", + "\n", + "# The State of AI\n", + "\n", + "Navigating the present and promise\n", + "of Generative AI\n", + "\n", + "ICONIQ | Growth\n", + "\n", + "Private and Strictly Confidential\n", + "Copyright © 2024 ICONIQ Capital, LLC. All Rights Reserved\n" + ] + } + ], + "source": [ + "print(text_nodes[0].get_content(metadata_mode=\"all\"))" + ] + }, + { + "cell_type": "markdown", + "id": "5b57d0fd-e026-45d9-bbe0-c5058be24b6b", + "metadata": {}, + "source": [ + "#### Add Contextual Summaries\n", + "\n", + "In this section we implement the key step in contextual retrieval - attaching metadata to each chunk that situates it within the overall document context.\n", + "\n", + "We take advantage of prompt caching by feeding in the static document as prefix tokens, and only swap out the \"header\" tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b96842f9-5c2d-4dd4-9574-c0b56d93ec27", + "metadata": {}, + "outputs": [], + "source": [ + "from copy import deepcopy\n", + "from llama_index.core.llms import ChatMessage\n", + "from llama_index.core.prompts import ChatPromptTemplate\n", + "import time\n", + "\n", + "\n", + "whole_doc_text = \"\"\"\\\n", + "Here is the entire document.\n", + "\n", + "{WHOLE_DOCUMENT}\n", + "\"\"\"\n", + "\n", + "chunk_text = \"\"\"\\\n", + "Here is the chunk we want to situate within the whole document\n", + "\n", + "{CHUNK_CONTENT}\n", + "\n", + "Please give a short succinct context to situate this chunk within the overall document for \\\n", + "the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.\"\"\"\n", + "\n", + "\n", + "def create_contextual_nodes(nodes, llm):\n", + " \"\"\"Function to create contextual nodes for a list of nodes\"\"\"\n", + " nodes_modified = []\n", + "\n", + " # get overall doc_text string\n", + " doc_text = \"\\n\".join([n.get_content(metadata_mode=\"all\") for n in nodes])\n", + "\n", + " for idx, node in enumerate(nodes):\n", + " start_time = time.time()\n", + " new_node = deepcopy(node)\n", + "\n", + " messages = [\n", + " ChatMessage(role=\"system\", content=\"You are a helpful AI Assistant.\"),\n", + " ChatMessage(\n", + " role=\"user\",\n", + " content=[\n", + " {\n", + " \"text\": whole_doc_text.format(WHOLE_DOCUMENT=doc_text),\n", + " \"type\": \"text\",\n", + " \"cache_control\": {\"type\": \"ephemeral\"},\n", + " },\n", + " {\n", + " \"text\": chunk_text.format(\n", + " CHUNK_CONTENT=node.get_content(metadata_mode=\"all\")\n", + " ),\n", + " \"type\": \"text\",\n", + " },\n", + " ],\n", + " ),\n", + " ]\n", + "\n", + " new_response = llm.chat(\n", + " messages, extra_headers={\"anthropic-beta\": \"prompt-caching-2024-07-31\"}\n", + " )\n", + " new_node.metadata[\"context\"] = str(new_response)\n", + "\n", + " nodes_modified.append(new_node)\n", + " print(f\"Completed node {idx}, {time.time() - start_time}\")\n", + "\n", + " return nodes_modified" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50c9cfe9-091f-41b7-a7b0-1bec18ac3ea6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Completed node 0, 3.079681158065796\n", + "Completed node 1, 2.306105136871338\n", + "Completed node 2, 2.9272632598876953\n", + "Completed node 3, 2.7051072120666504\n", + "Completed node 4, 2.5174269676208496\n", + "Completed node 5, 2.593230962753296\n", + "Completed node 6, 17.79446506500244\n", + "Completed node 7, 2.357940912246704\n", + "Completed node 8, 22.41524910926819\n", + "Completed node 9, 2.3640670776367188\n", + "Completed node 10, 24.634361743927002\n", + "Completed node 11, 3.069308042526245\n", + "Completed node 12, 23.27754497528076\n", + "Completed node 13, 3.3801419734954834\n", + "Completed node 14, 22.186962842941284\n", + "Completed node 15, 2.9594428539276123\n", + "Completed node 16, 22.680989027023315\n", + "Completed node 17, 2.8793280124664307\n", + "Completed node 18, 22.91075611114502\n", + "Completed node 19, 2.824723958969116\n", + "Completed node 20, 23.572262287139893\n", + "Completed node 21, 2.9115028381347656\n", + "Completed node 22, 22.8908531665802\n", + "Completed node 23, 2.2966439723968506\n", + "Completed node 24, 23.58935308456421\n", + "Completed node 25, 2.6247501373291016\n", + "Completed node 26, 22.399968147277832\n", + "Completed node 27, 3.0899431705474854\n", + "Completed node 28, 22.961134910583496\n", + "Completed node 29, 3.1315767765045166\n", + "Completed node 30, 22.38727903366089\n", + "Completed node 31, 2.507817268371582\n", + "Completed node 32, 23.75781512260437\n", + "Completed node 33, 3.65451717376709\n", + "Completed node 34, 22.2336208820343\n", + "Completed node 35, 2.84831166267395\n", + "Completed node 36, 23.35297417640686\n", + "Completed node 37, 3.027301073074341\n", + "Completed node 38, 22.720845937728882\n", + "Completed node 39, 2.849353313446045\n", + "Completed node 40, 24.094517946243286\n" + ] + } + ], + "source": [ + "new_text_nodes = create_contextual_nodes(text_nodes, llm)" + ] + }, + { + "cell_type": "markdown", + "id": "4f404f56-db1e-4ed7-9ba1-ead763546348", + "metadata": {}, + "source": [ + "#### Build Index\n", + "\n", + "Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ea53c31-0e38-421c-8d9b-0e3adaa1677e", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from llama_index.core import (\n", + " StorageContext,\n", + " VectorStoreIndex,\n", + " load_index_from_storage,\n", + ")\n", + "\n", + "if not os.path.exists(\"storage_nodes_iconiq\"):\n", + " index = VectorStoreIndex(new_text_nodes, embed_model=embed_model)\n", + " # save index to disk\n", + " index.set_index_id(\"vector_index\")\n", + " index.storage_context.persist(\"./storage_nodes_iconiq\")\n", + "else:\n", + " # rebuild storage context\n", + " storage_context = StorageContext.from_defaults(persist_dir=\"storage_nodes_iconiq\")\n", + " # load index\n", + " index = load_index_from_storage(storage_context, index_id=\"vector_index\")\n", + "\n", + "retriever = index.as_retriever()" + ] + }, + { + "cell_type": "markdown", + "id": "7ea66d70-df0d-42a2-adb3-b20c9d29878a", + "metadata": {}, + "source": [ + "#### Build Baseline Index\n", + "\n", + "Build a baseline index with the text nodes without summarized context." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b1f0efa-77ca-4e5a-9d48-17dc442615f3", + "metadata": {}, + "outputs": [], + "source": [ + "if not os.path.exists(\"storage_nodes_iconiq_base\"):\n", + " base_index = VectorStoreIndex(text_nodes, embed_model=embed_model)\n", + " # save index to disk\n", + " base_index.set_index_id(\"vector_index\")\n", + " base_index.storage_context.persist(\"./storage_nodes_iconiq_base\")\n", + "else:\n", + " # rebuild storage context\n", + " storage_context = StorageContext.from_defaults(\n", + " persist_dir=\"storage_nodes_iconiq_base\"\n", + " )\n", + " # load index\n", + " base_index = load_index_from_storage(storage_context, index_id=\"vector_index\")" + ] + }, + { + "cell_type": "markdown", + "id": "5f0e33a4-9422-498d-87ee-d917bdf74d80", + "metadata": {}, + "source": [ + "## Build Multimodal Query Engine\n", + "\n", + "We now use LlamaIndex abstractions to build a **custom query engine**. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35a94be2-e289-41a6-92e4-d3cb428fb0c8", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index.core.query_engine import CustomQueryEngine, SimpleMultiModalQueryEngine\n", + "from llama_index.core.retrievers import BaseRetriever\n", + "from llama_index.multi_modal_llms.openai import OpenAIMultiModal\n", + "from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode\n", + "from llama_index.core.prompts import PromptTemplate\n", + "from llama_index.core.base.response.schema import Response\n", + "from typing import Optional\n", + "\n", + "\n", + "gpt_4o = OpenAIMultiModal(model=\"gpt-4o\", max_new_tokens=4096)\n", + "\n", + "QA_PROMPT_TMPL = \"\"\"\\\n", + "Below we give parsed text from slides in two different formats, as well as the image.\n", + "\n", + "---------------------\n", + "{context_str}\n", + "---------------------\n", + "Given the context information and not prior knowledge, answer the query. Explain whether you got the answer\n", + "from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.\n", + "\n", + "Query: {query_str}\n", + "Answer: \"\"\"\n", + "\n", + "QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)\n", + "\n", + "\n", + "class MultimodalQueryEngine(CustomQueryEngine):\n", + " \"\"\"Custom multimodal Query Engine.\n", + "\n", + " Takes in a retriever to retrieve a set of document nodes.\n", + " Also takes in a prompt template and multimodal model.\n", + "\n", + " \"\"\"\n", + "\n", + " qa_prompt: PromptTemplate\n", + " retriever: BaseRetriever\n", + " multi_modal_llm: OpenAIMultiModal\n", + "\n", + " def __init__(self, qa_prompt: Optional[PromptTemplate] = None, **kwargs) -> None:\n", + " \"\"\"Initialize.\"\"\"\n", + " super().__init__(qa_prompt=qa_prompt or QA_PROMPT, **kwargs)\n", + "\n", + " def custom_query(self, query_str: str):\n", + " # retrieve text nodes\n", + " nodes = self.retriever.retrieve(query_str)\n", + " # create ImageNode items from text nodes\n", + " image_nodes = [\n", + " NodeWithScore(node=ImageNode(image_path=n.metadata[\"image_path\"]))\n", + " for n in nodes\n", + " ]\n", + "\n", + " # create context string from text nodes, dump into the prompt\n", + " context_str = \"\\n\\n\".join(\n", + " [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]\n", + " )\n", + " fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)\n", + "\n", + " # synthesize an answer from formatted text and images\n", + " llm_response = self.multi_modal_llm.complete(\n", + " prompt=fmt_prompt,\n", + " image_documents=[image_node.node for image_node in image_nodes],\n", + " )\n", + " return Response(\n", + " response=str(llm_response),\n", + " source_nodes=nodes,\n", + " metadata={\"text_nodes\": nodes, \"image_nodes\": image_nodes},\n", + " )\n", + "\n", + " return response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0890be59-fb12-4bb5-959b-b2d9600f7774", + "metadata": {}, + "outputs": [], + "source": [ + "query_engine = MultimodalQueryEngine(\n", + " retriever=index.as_retriever(similarity_top_k=3), multi_modal_llm=gpt_4o\n", + ")\n", + "base_query_engine = MultimodalQueryEngine(\n", + " retriever=base_index.as_retriever(similarity_top_k=3), multi_modal_llm=gpt_4o\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1f94ef26-0df5-4468-a156-903d686f02ce", + "metadata": {}, + "source": [ + "## Try out Queries\n", + "\n", + "Let's try out some questions against the slide deck in this multimodal RAG pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fd1aae3-1f8a-4797-a24a-17e563a7165e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The departments that use generative AI the most are:\n", + "\n", + "1. **AI, Machine Learning, and Data Science**: With a score of 4.5, this department leads in generative AI usage. They likely use AI for advanced data analysis, model development, and improving AI algorithms.\n", + "\n", + "2. **IT**: Scoring 4.0, IT teams use generative AI for ticket management, chatbots, customer support, troubleshooting, and knowledge management.\n", + "\n", + "3. **Engineering / R&D**: With a score of 3.9, they use AI to improve coding velocity, refactor code, augment test cases, summarize business requirements, accelerate code reviews, conduct user research, and prototype.\n", + "\n", + "These insights are derived from the parsed markdown text, which provides detailed scores and use cases for each department. The image confirms this information, showing the same scores and use cases. There are no discrepancies between the parsed text and the image.\n" + ] + } + ], + "source": [ + "response = query_engine.query(\n", + " \"which departments/teams use genAI the most and how are they using it?\"\n", + ")\n", + "print(str(response))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9cc48ee-481b-40b1-91b3-c69220e9dfb0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Based on the parsed text from the slides:\n", + "\n", + "1. **Departments/Teams Using GenAI the Most:**\n", + " - **AI, Machine Learning, and Data Science**: Highest usage with a score of 4.5.\n", + " - **IT**: Score of 4.0.\n", + " - **Engineering/R&D**: Score of 3.9.\n", + "\n", + "2. **How They Are Using GenAI:**\n", + " - **AI, Machine Learning, and Data Science**: Likely using GenAI for advanced analytics and model development.\n", + " - **IT**: Utilizes GenAI for internal productivity, IT operations, and software code development.\n", + " - **Engineering/R&D**: Uses GenAI for improving coding velocity, code refactoring, augmenting test cases, and accelerating code reviews.\n", + "\n", + "The information was derived from the parsed markdown text. There are no discrepancies between the parsed text and the images provided. The parsed text clearly outlines the departments with the highest GenAI usage and their specific applications.\n" + ] + } + ], + "source": [ + "base_response = base_query_engine.query(\n", + " \"which departments/teams use genAI the most and how are they using it?\"\n", + ")\n", + "print(str(base_response))" + ] + }, + { + "cell_type": "markdown", + "id": "7b906cb8-07ba-4a8c-9ff8-5162869ad408", + "metadata": {}, + "source": [ + "**NOTE**: the relevant page numbers are 32-38. The response with contextual retrieval retrieves the slide detailing IT use cases, hence giving a more detailed response on the IT side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b7a8c5f-39fc-4d04-8c56-3642f5718437", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "32,33,34\n", + "32,21,33\n" + ] + } + ], + "source": [ + "get_source_page_nums(response)\n", + "get_source_page_nums(base_response)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85a2e748-cc40-4b9f-9401-2ea912839502", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "page_num: 32\n", + "image_path: data_images_iconiq/11f19cc3-c02e-4271-a84f-9a043457fd69-page_32.jpg\n", + "parsed_text_markdown: # AI Usage by Function\n", + "\n", + "Technical teams lead in adoption of generative AI for internal productivity, while HR and legal functions lag, likely hindered by data privacy and quality concerns\n", + "\n", + "For each department / function in your company, please indicate their level of generative AI usage on a scale of 1-5.\n", + "Weighted Average Score by % of Respondents (N = 143)\n", + "\n", + "| Department/Function | Score |\n", + "|---------------------|-------|\n", + "| AI, Machine Learning, and Data Science | 4.5 |\n", + "| IT | 4.0 |\n", + "| Engineering / R&D | 3.9 |\n", + "| Product Development & Management | 3.5 |\n", + "| Marketing | 3.4 |\n", + "| Operations | 3.3 |\n", + "| Strategy and Competitive Intelligence | 3.3 |\n", + "| Sales | 3.2 |\n", + "| Finance | 3.0 |\n", + "| Administration | 2.9 |\n", + "| Human Resources | 2.7 |\n", + "| Legal | 2.7 |\n", + "\n", + "> We are creating a sense of artificial FOMO among our workforce to encourage participation in pilot groups that will have early access to new GenAI tools\n", + "> \n", + "> Chief Information Officer, Technology Company\n", + "\n", + "Source: Perspectives from the ICONIQ Growth GenAI Survey (June 2024) and perspectives from the ICONIQ Growth team and network of AI leaders consisting of our community of CIO/CDOs overseeing AI initiatives in enterprises, CTOs, our Technical Advisory Board, and others in our network\n", + "\n", + "Private & Strictly Confidential\n", + "context: assistant: This chunk is part of the \"Deep Dive on Applications\" section of the report, providing data on AI adoption across different business functions. It shows which departments are leading in generative AI usage, with technical teams at the forefront and HR/legal lagging behind.\n" + ] + } + ], + "source": [ + "# look at an example retrieved source node\n", + "print(response.source_nodes[0].get_content(metadata_mode=\"all\"))" + ] + }, + { + "cell_type": "markdown", + "id": "a9462a82-960a-4c42-bbca-a1e71c2c1e5c", + "metadata": {}, + "source": [ + "In this next question, the same sources are retrieved with and without contextual retrieval, and the answer is correct for both approaches. This is thanks for LlamaParse Premium's ability to comprehend graphs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8a0c8b1-3a3e-41c1-9916-01fdfb0dd8e9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The \"Deep Dive on Infrastructure\" section provides insights into deployment environments and infrastructure tooling for generative AI models:\n", + "\n", + "1. **Deployment Environments**:\n", + " - Enterprises primarily use cloud or hybrid approaches for hosting generative AI workloads.\n", + " - 56% of respondents prefer cloud deployment, while 42% use a hybrid method.\n", + " - AWS (68%) and Azure (61%) are the most utilized cloud service providers, with Google Cloud at 40%.\n", + "\n", + "2. **Infrastructure Tooling**:\n", + " - Enterprises are investing in infrastructure tools for data observability, database augmentation, and data pre-processing.\n", + " - Key areas for infrastructure tooling include observability, evaluation, and security (50%), databases (48%), and data pre-processing (47%).\n", + "\n", + "These insights were derived from the parsed markdown text, which provides detailed information on deployment preferences and infrastructure investments. There are no discrepancies between the parsed text and the images provided.\n" + ] + } + ], + "source": [ + "query = \"what are relevant insights from the 'deep dive on infrastructure' section in terms of model preferences, cost, deployment environments?\"\n", + "\n", + "response = query_engine.query(query)\n", + "print(str(response))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f1638c6-ca29-462b-a21f-a2941968259c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The \"Deep Dive on Infrastructure\" section does not provide specific insights on model preferences, cost, or deployment environments based on the parsed text. The slide titled \"Deep Dive on Infrastructure\" only contains the title and copyright information, without any detailed content or data.\n", + "\n", + "This conclusion is drawn from the parsed markdown text, which lacks any specific information on model preferences, cost, or deployment environments in that section. The image confirms this, as it only shows the title and a graphic without additional details.\n", + "\n", + "If you need insights on these topics, you might want to refer to other sections or slides that specifically address model preferences, costs, or deployment environments.\n" + ] + } + ], + "source": [ + "base_response = base_query_engine.query(query)\n", + "print(str(base_response))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d6eb745-b3d3-4e37-bb2d-d2d649d77d01", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "24,30,26\n", + "30,17,24\n" + ] + } + ], + "source": [ + "get_source_page_nums(response)\n", + "get_source_page_nums(base_response)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc741ad9-47da-47e7-b1b2-540d686c0bf4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "page_num: 26\n", + "image_path: data_images_iconiq/11f19cc3-c02e-4271-a84f-9a043457fd69-page_26.jpg\n", + "parsed_text_markdown: # Cloud Deployment Method\n", + "\n", + "Enterprises are primarily hosting generative AI workloads on the cloud or via a hybrid approach; AWS and Azure are the most utilized cloud service providers\n", + "\n", + "## Preferred Deployment Method for GenAI Models\n", + "% of Respondents (N = 126)\n", + "\n", + "| Method | Percentage |\n", + "|----------|------------|\n", + "| On-prem | 2% |\n", + "| Hybrid | 42% |\n", + "| Cloud | 56% |\n", + "\n", + "## CSP Used for GenAI Products\n", + "Multi-Select, % of Respondents (N = 218)\n", + "\n", + "| Cloud Service Provider | Percentage |\n", + "|----------------------------|------------|\n", + "| Amazon Web Services (AWS) | 68% |\n", + "| Microsoft Azure | 61% |\n", + "| Google Cloud (GCP) | 40% |\n", + "| Other | 3% |\n", + "\n", + "While Azure has captured mindshare with its OpenAI, Amazon remains ahead in terms of cloud usage given the dominant market share AWS has in cloud¹\n", + "\n", + "Notes: (1) Statista Worldwide Market Share of Leading Cloud Infrastructure Service Providers (May 2024)\n", + "\n", + "Source: Perspectives from the ICONIQ Growth GenAI Survey (June 2024) and perspectives from the ICONIQ Growth team and network of AI leaders consisting of our community of CIO/CDOs overseeing AI initiatives in enterprises, CTOs, our Technical Advisory Board, and others in our network\n", + "\n", + "Private & Strictly Confidential\n", + "context: assistant: This chunk is part of the \"Deep Dive on Infrastructure\" section of the report, discussing cloud deployment methods and cloud service providers used for generative AI workloads by enterprises. It follows sections on key purchasing criteria for AI models and precedes information on proprietary vs open source models.\n" + ] + } + ], + "source": [ + "# look at an example retrieved source node\n", + "print(response.source_nodes[2].get_content(metadata_mode=\"all\"))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "llama_parse", + "language": "python", + "name": "llama_parse" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/multimodal/multimodal_contextual_retrieval_rag_img.png b/examples/multimodal/multimodal_contextual_retrieval_rag_img.png new file mode 100644 index 0000000..1f37e52 Binary files /dev/null and b/examples/multimodal/multimodal_contextual_retrieval_rag_img.png differ