Merge branch 'main' into patch-2

meta-llama · Dec 8, 2023 · 2216dd4 · 2216dd4
2 parents 7358dad + 1f527a8
commit 2216dd4
Show file tree

Hide file tree

Showing 31 changed files with 2,638 additions and 71 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,10 @@
 # Llama 2 Fine-tuning / Inference Recipes, Examples and Demo Apps
 
-**[Update Oct. 20, 2023] We have just released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama 2 locally and in the cloud to chat about data (PDF, DB, or live) and generate video summary.**
-
+**[Update Nov. 16, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**
 
 The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).
 
-In addition, we also provide a number of demo apps, to showcase the Llama2 usage along with other ecosystem solutions to run Llama2 locally on your mac and on cloud.
-
+In addition, we also provide a number of demo apps, to showcase the Llama 2 usage along with other ecosystem solutions to run Llama 2 locally, in the cloud, and on-prem.
 
 Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the [Responsible Use Guide](https://github.com/facebookresearch/llama/blob/main/Responsible-Use-Guide.pdf). More details can be found in our research paper as well. For downloading the models, follow the instructions on [Llama 2 repo](https://github.com/facebookresearch/llama).
 
@@ -23,8 +21,6 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
 6. [Repository Organization](#repository-organization)
 7. [License and Acceptable Use Policy](#license)
 
-
-
 # Quick Start
 
 [Llama 2 Jupyter Notebook](./examples/quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
@@ -184,14 +180,17 @@ You can read more about our fine-tuning strategies [here](./docs/LLM_finetuning.
 # Demo Apps
 This folder contains a series of Llama2-powered apps:
 * Quickstart Llama deployments and basic interactions with Llama
-  1. Llama on your Mac and ask Llama general questions
-  2. Llama on Google Colab
-  3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
+1. Llama on your Mac and ask Llama general questions
+2. Llama on Google Colab
+3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
+4. Llama on-prem with vLLM and TGI
+5. Llama chatbot with RAG (Retrieval Augmented Generation)
 
 * Specialized Llama use cases:
-  1. Ask Llama to summarize a video content
-  2. Ask Llama questions about structured data in a DB
-  3. Ask Llama questions about live data on the web
+1. Ask Llama to summarize a video content
+2. Ask Llama questions about structured data in a DB
+3. Ask Llama questions about live data on the web
+4. Build a Llama-enabled WhatsApp chatbot
 
 # Repository Organization
 This repository is organized in the following way:

diff --git a/demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb b/demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
diff --git a/demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf b/demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
diff --git a/demo_apps/RAG_Chatbot_example/requirements.txt b/demo_apps/RAG_Chatbot_example/requirements.txt
@@ -0,0 +1,6 @@
+gradio
+pypdf
+langchain
+sentence-transformers
+faiss-cpu
+text-generation
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss b/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl b/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
diff --git a/demo_apps/README.md b/demo_apps/README.md
@@ -1,25 +1,24 @@
-# Llama2 Demo Apps 
+# Llama 2 Demo Apps
 
-This folder contains a series of Llama2-powered apps:
+This folder contains a series of Llama 2-powered apps:
 * Quickstart Llama deployments and basic interactions with Llama
 1. Llama on your Mac and ask Llama general questions
 2. Llama on Google Colab
 3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
+4. Llama on-prem with vLLM and TGI
+5. Llama chatbot with RAG (Retrieval Augmented Generation)
 
 * Specialized Llama use cases:
 1. Ask Llama to summarize a video content
 2. Ask Llama questions about structured data in a DB
 3. Ask Llama questions about live data on the web
+4. Build a Llama-enabled WhatsApp chatbot
 
-If you need a general understanding of GenAI, Llama2, prompt engineering and RAG (Retrieval Augmented Generation), be sure to first check the [Getting to know Llama 2 notebook](https://github.com/facebookresearch/llama-recipes/blob/main/examples/Getting_to_know_Llama.ipynb) and its Meta Connect video [here](https://www.facebook.com/watch/?v=662153709222699).
+We also show how to build quick web UI for Llama 2 demo apps using Streamlit and Gradio.
 
-We start with three quickstart demos showing how to run Llama2 locally on a Mac, remotely in the cloud, and on a Google Colab to ask Llama2 general questions or questions about unstructured data not trained for the model.
+If you need a general understanding of GenAI, Llama 2, prompt engineering and RAG (Retrieval Augmented Generation), be sure to first check the [Getting to know Llama 2 notebook](https://github.com/facebookresearch/llama-recipes/blob/main/examples/Getting_to_know_Llama.ipynb) and its Meta Connect video [here](https://www.facebook.com/watch/?v=662153709222699).
 
-We then show three demos that ask Llama2 to summarize a YouTube video, to answer questions about structured data stored in a database, and to answer questions about live search results.
-
-We also show how to build quick web UI for Llama2 demo apps using Streamlit and Gradio.
-
-More advanced Llama2 demo apps will be coming soon.
+More advanced Llama 2 demo apps will be coming soon.
 
 ## Setting Up Environment
 
@@ -30,7 +29,7 @@ conda activate llama-demo-apps
 pip install jupyter
 cd <your_work_folder>
 git clone https://github.com/facebookresearch/llama-recipes
-cd llama-recipes/llama-demo-apps
+cd llama-recipes/demo-apps
 jupyter notebook
 ```
 
@@ -41,7 +40,7 @@ You can also upload the notebooks to Google Colab.
 The first three demo apps show:
 * how to run Llama2 locally on a Mac, in the Google Colab notebook, and in the cloud using Replicate;
 * how to use [LangChain](https://github.com/langchain-ai/langchain), an open-source framework for building LLM apps, to ask Llama general questions in different ways;
-* how to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and ask questions about it. This is the well known RAG method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination. 
+* how to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and ask questions about it. This is the well known RAG method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination.
 * how to ask follow up questions to Llama by sending previous questions and answers as the context along with the new question, hence performing multi-turn chat or conversation with Llama.
 
 ### [Running Llama2 Locally on Mac](HelloLlamaLocal.ipynb)
@@ -57,22 +56,28 @@ python convert.py <path_to_your_downloaded_llama-2-13b_model>
 ### [Running Llama2 Hosted in the Cloud](HelloLlamaCloud.ipynb)
 The HelloLlama cloud version uses LangChain with Llama2 hosted in the cloud on [Replicate](https://replicate.com). The demo shows how to ask Llama general questions and follow up questions, and how to use LangChain to ask Llama2 questions about **unstructured** data stored in a PDF.
 
-**<a id="replicate_note">Note on using Replicate</a>** 
+**<a id="replicate_note">Note on using Replicate</a>**
 To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
 
 ### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
-To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-13b-chat model `ggml-model-q4_0.gguf` [here](https://drive.google.com/file/d/1afPv3HOy73BE2MoYCgYJvBDeQNa9rZbj/view?usp=sharing), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minnutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
+To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-13b-chat model `ggml-model-q4_0.gguf` [here](https://drive.google.com/file/d/1afPv3HOy73BE2MoYCgYJvBDeQNa9rZbj/view?usp=sharing), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
+
+## [Running Llama2 On-Prem with vLLM and TGI](llama-on-prem.md)
+This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 2 on-prem apps.
 
 \* To run a quantized Llama2 model on iOS and Android, you can use  the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)).
 
 ## [VideoSummary](VideoSummary.ipynb): Ask Llama2 to Summarize a YouTube Video
 This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size.
 
 ## [NBA2023-24](StructuredLlama.ipynb): Ask Llama2 about Structured Data
-This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players. 
+This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players.
 
 ## [LiveData](LiveData.ipynb): Ask Llama2 about Live Data
-This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com serarch API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
+This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
+
+## [WhatsApp Chatbot](whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot
+This step-by-step tutorial shows how to use the [WhatsApp Business API](https://developers.facebook.com/docs/whatsapp/cloud-api/overview), LangChain and Replicate to build a Llama-enabled WhatsApp chatbot.
 
 ## Quick Web UI for Llama2 Chat
 If you prefer to see Llama2 in action in a web UI, instead of the notebooks above, you can try one of the two methods:
@@ -98,4 +103,7 @@ To see how to query Llama2 and get answers with the Gradio UI both from the note
 
 Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
 
-![](llama2-gradio.png)
+![](llama2-gradio.png)
+
+### [RAG Chatbot Example](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)
+A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data.