Skip to content

Commit

Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
mreso authored Dec 8, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
2 parents 15a1206 + 1b9934e commit b9abee5
Showing 26 changed files with 2,170 additions and 40 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Llama 2 Fine-tuning / Inference Recipes, Examples and Demo Apps

**[Update Nov. 14, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama 2 locally, in the cloud, on-prem or with WhatsApp, and how to ask Llama 2 questions in general and about custom data (PDF, DB, or live).**
**[Update Nov. 16, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**

The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).

@@ -184,6 +184,7 @@ This folder contains a series of Llama2-powered apps:
2. Llama on Google Colab
3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
4. Llama on-prem with vLLM and TGI
5. Llama chatbot with RAG (Retrieval Augmented Generation)

* Specialized Llama use cases:
1. Ask Llama to summarize a video content
717 changes: 717 additions & 0 deletions demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb

Large diffs are not rendered by default.

Binary file not shown.
6 changes: 6 additions & 0 deletions demo_apps/RAG_Chatbot_example/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
gradio
pypdf
langchain
sentence-transformers
faiss-cpu
text-generation
Binary file not shown.
Binary file not shown.
4 changes: 4 additions & 0 deletions demo_apps/README.md
Original file line number Diff line number Diff line change
@@ -6,6 +6,7 @@ This folder contains a series of Llama 2-powered apps:
2. Llama on Google Colab
3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
4. Llama on-prem with vLLM and TGI
5. Llama chatbot with RAG (Retrieval Augmented Generation)

* Specialized Llama use cases:
1. Ask Llama to summarize a video content
@@ -103,3 +104,6 @@ To see how to query Llama2 and get answers with the Gradio UI both from the note
Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:

![](llama2-gradio.png)

### [RAG Chatbot Example](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)
A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data.
2 changes: 2 additions & 0 deletions demo_apps/llama-on-prem.md
Original file line number Diff line number Diff line change
@@ -22,7 +22,9 @@ pip install vllm

Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login.

<!-- markdown-link-check-disable -->
There are two ways to deploy Llama 2 via vLLM, as a general API server or an OpenAI-compatible server (see [here](https://platform.openai.com/docs/api-reference/authentication) on how the OpenAI API authenticates, but you won't need to provide a real OpenAI API key when running Llama 2 via vLLM in the OpenAI-compatible mode).
<!-- markdown-link-check-enable -->

### Deploying Llama 2 as an API Server

14 changes: 12 additions & 2 deletions examples/inference.py
Original file line number Diff line number Diff line change
@@ -11,7 +11,7 @@
import torch
from transformers import LlamaTokenizer

from llama_recipes.inference.safety_utils import get_safety_checker
from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
from llama_recipes.inference.model_utils import load_model, load_peft_model


@@ -33,6 +33,8 @@ def main(
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
enable_llamaguard_content_safety: bool=False,
llamaguard_model_name: str=None,
max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
**kwargs
@@ -48,6 +50,12 @@ def main(
else:
print("No user prompt provided. Exiting.")
sys.exit(1)

if enable_llamaguard_content_safety:
if not llamaguard_model_name:
print("if enable_llamaguard_content_safety is used, provide the model path with --llamaguard_model_name")
sys.exit(1)


# Set the seeds for reproducibility
torch.cuda.manual_seed(seed)
@@ -77,6 +85,8 @@ def main(
safety_checker = get_safety_checker(enable_azure_content_safety,
enable_sensitive_topics,
enable_salesforce_content_safety,
enable_llamaguard_content_safety,
guard_lama_path=llamaguard_model_name
)

# Safety check of the user prompt
@@ -117,7 +127,7 @@ def main(
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Safety check of the model output
safety_results = [check(output_text) for check in safety_checker]
safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
are_safe = all([r[1] for r in safety_results])
if are_safe:
print("User input and model output deemed safe.")
19 changes: 19 additions & 0 deletions examples/llama_guard/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Llama Guard demo
<!-- markdown-link-check-disable -->
Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).

This folder contains the files for the function used in the safety_checker when running in the inference script.

## Requirements
1. Llama guard model weights downloaded. To download, follow the steps shown [here](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard#download)
2. Llama recipes dependencies installed
3. A GPU with at least 21 GB of free RAM to load the 7B model. To run both Llama 2 7B and Llama Guard, multiple GPUS or a single one with additional memory is required.
<!-- markdown-link-check-enable -->
### Inference Safety Checker
When running the regular inference script with prompts, Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be show, else a message with the error will be show, with the word unsafe and a comma separated list of categories infringed. As the model is not quantized, it requires more GPU than the direct examples, to load the desired Llama model for inference and the Llama Guard model for safety checks. Using Llama 2 7B quantized, this was able to be run in a machine with four A10G GPUs.
Use this command for testing with a quantized Llama model, modifying the values accordingly:

`RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=29500 python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization --enable_llamaguard_content_safety --llamaguard_model_name <path_to_mode>`



6 changes: 6 additions & 0 deletions examples/llama_guard/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

from .generation import Llama, Dialog
from .model import ModelArgs, Transformer
from .tokenizer import Tokenizer
Loading

0 comments on commit b9abee5

Please sign in to comment.