Merge branch 'main' into main

meta-llama · Dec 8, 2023 · b9abee5 · b9abee5
2 parents 15a1206 + 1b9934e
commit b9abee5
Showing 26 changed files with 2,170 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Llama 2 Fine-tuning / Inference Recipes, Examples and Demo Apps
 
-**[Update Nov. 14, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama 2 locally, in the cloud, on-prem or with WhatsApp, and how to ask Llama 2 questions in general and about custom data (PDF, DB, or live).**
+**[Update Nov. 16, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**
 
 The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).
 
@@ -184,6 +184,7 @@ This folder contains a series of Llama2-powered apps:
 2. Llama on Google Colab
 3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
 4. Llama on-prem with vLLM and TGI
+5. Llama chatbot with RAG (Retrieval Augmented Generation)
 
 * Specialized Llama use cases:
 1. Ask Llama to summarize a video content

diff --git a/demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb b/demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
diff --git a/demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf b/demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
diff --git a/demo_apps/RAG_Chatbot_example/requirements.txt b/demo_apps/RAG_Chatbot_example/requirements.txt
@@ -0,0 +1,6 @@
+gradio
+pypdf
+langchain
+sentence-transformers
+faiss-cpu
+text-generation
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss b/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl b/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
diff --git a/demo_apps/README.md b/demo_apps/README.md
@@ -6,6 +6,7 @@ This folder contains a series of Llama 2-powered apps:
 2. Llama on Google Colab
 3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
 4. Llama on-prem with vLLM and TGI
+5. Llama chatbot with RAG (Retrieval Augmented Generation)
 
 * Specialized Llama use cases:
 1. Ask Llama to summarize a video content
@@ -103,3 +104,6 @@ To see how to query Llama2 and get answers with the Gradio UI both from the note
 Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
 
 ![](llama2-gradio.png)
+
+### [RAG Chatbot Example](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)
+A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data.
diff --git a/demo_apps/llama-on-prem.md b/demo_apps/llama-on-prem.md
@@ -22,7 +22,9 @@ pip install vllm
 
 Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login.
 
+<!-- markdown-link-check-disable -->
 There are two ways to deploy Llama 2 via vLLM, as a general API server or an OpenAI-compatible server (see [here](https://platform.openai.com/docs/api-reference/authentication) on how the OpenAI API authenticates, but you won't need to provide a real OpenAI API key when running Llama 2 via vLLM in the OpenAI-compatible mode).
+<!-- markdown-link-check-enable -->
 
 ### Deploying Llama 2 as an API Server
 

diff --git a/examples/inference.py b/examples/inference.py
@@ -11,7 +11,7 @@
 import torch
 from transformers import LlamaTokenizer
 
-from llama_recipes.inference.safety_utils import get_safety_checker
+from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
 from llama_recipes.inference.model_utils import load_model, load_peft_model
 
 
@@ -33,6 +33,8 @@ def main(
     enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
     enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
     enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool=False,
+    llamaguard_model_name: str=None,
     max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
     use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
     **kwargs
@@ -48,6 +50,12 @@ def main(
     else:
         print("No user prompt provided. Exiting.")
         sys.exit(1)
+
+    if enable_llamaguard_content_safety:
+        if not llamaguard_model_name:
+            print("if enable_llamaguard_content_safety is used, provide the model path with --llamaguard_model_name")
+            sys.exit(1)
+
 
     # Set the seeds for reproducibility
     torch.cuda.manual_seed(seed)
@@ -77,6 +85,8 @@ def main(
     safety_checker = get_safety_checker(enable_azure_content_safety,
                                         enable_sensitive_topics,
                                         enable_salesforce_content_safety,
+                                        enable_llamaguard_content_safety,
+                                        guard_lama_path=llamaguard_model_name
                                         )
 
     # Safety check of the user prompt
@@ -117,7 +127,7 @@ def main(
     output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 
     # Safety check of the model output
-    safety_results = [check(output_text) for check in safety_checker]
+    safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
     are_safe = all([r[1] for r in safety_results])
     if are_safe:
         print("User input and model output deemed safe.")

diff --git a/examples/llama_guard/README.md b/examples/llama_guard/README.md
@@ -0,0 +1,19 @@
+# Llama Guard demo
+<!-- markdown-link-check-disable -->
+Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
+
+This folder contains the files for the function used in the safety_checker when running in the inference script.
+
+## Requirements
+1. Llama guard model weights downloaded. To download, follow the steps shown [here](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard#download)
+2. Llama recipes dependencies installed 
+3. A GPU with at least 21 GB of free RAM to load the 7B model. To run both Llama 2 7B and Llama Guard, multiple GPUS or a single one with additional memory is required.
+<!-- markdown-link-check-enable -->
+### Inference Safety Checker
+When running the regular inference script with prompts, Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be show, else a message with the error will be show, with the word unsafe and a comma separated list of categories infringed. As the model is not quantized, it requires more GPU than the direct examples, to load the desired Llama model for inference and the Llama Guard model for safety checks. Using Llama 2 7B quantized, this was able to be run in a machine with four A10G GPUs.
+Use this command for testing with a quantized Llama model, modifying the values accordingly:
+
+`RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=29500 python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization --enable_llamaguard_content_safety --llamaguard_model_name <path_to_mode>`
+
+
+
diff --git a/examples/llama_guard/__init__.py b/examples/llama_guard/__init__.py
@@ -0,0 +1,6 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+from .generation import Llama, Dialog
+from .model import ModelArgs, Transformer
+from .tokenizer import Tokenizer