diff --git a/recipes/quickstart/NotebookLlama/README.md b/recipes/quickstart/NotebookLlama/README.md new file mode 100644 index 000000000..ea7d827be --- /dev/null +++ b/recipes/quickstart/NotebookLlama/README.md @@ -0,0 +1,93 @@ +## NotebookLlama: An Open Source version of NotebookLM + +![NotebookLlama](./resources/Outline.jpg) + +[Listen to audio from the example here](./resources/_podcast.mp3) + +This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow. + +You will also learn from the experiments of using Text to Speech Models. + +It assumes zero knowledge of LLMs, prompting and audio models, everything is covered in their respective notebooks. + +### Outline: + +Here is step by step thought (pun intended) for the task: + +- Step 1: Pre-process PDF: Use `Llama-3.2-1B-Instruct` to pre-process the PDF and save it in a `.txt` file. +- Step 2: Transcript Writer: Use `Llama-3.1-70B-Instruct` model to write a podcast transcript from the text +- Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B-Instruct` model to make the transcript more dramatic +- Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1` and `bark/suno` to generate a conversational podcast + +Note 1: In Step 1, we prompt the 1B model to not modify the text or summarize it, strictly clean up extra characters or garbage characters that might get picked due to encoding from PDF. Please see the prompt in Notebook 1 for more details. + +Note 2: For Step 2, you can also use `Llama-3.1-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples. + +### Detailed steps on running the notebook: + +Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models. +For running the 70B model, you will need a GPU with aggregated memory around 140GB to infer in bfloat-16 precision. + +Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you! + +- Before getting started, please make sure to login using the `huggingface cli` and then launch your jupyter notebook server to make sure you are able to download the Llama models. + +You'll need your Hugging Face access token, which you can get at your Settings page [here](https://huggingface.co/settings/tokens). Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login to make sure the scripts can download Hugging Face models if needed. + +- First, please Install the requirements from [here]() by running inside the folder: + +``` +git clone https://github.com/meta-llama/llama-recipes +cd llama-recipes/recipes/quickstart/NotebookLlama/ +pip install -r requirements.txt +``` + +- Notebook 1: + +This notebook is used for processing the PDF and processing it using the new Feather light model into a `.txt` file. + +Update the first cell with a PDF link that you would like to use. Please decide on a PDF to use for Notebook 1, it can be any link but please remember to update the first cell of the notebook with the right link. + +Please try changing the prompts for the `Llama-3.2-1B-Instruct` model and see if you can improve results. + +- Notebook 2: + +This notebook will take in the processed output from Notebook 1 and creatively convert it into a podcast transcript using the `Llama-3.1-70B-Instruct` model. If you are GPU rich, please feel free to test with the 405B model! + +Please try experimenting with the System prompts for the model and see if you can improve the results and try the 8B model as well here to see if there is a huge difference! + +- Notebook 3: + +This notebook takes the transcript from earlier and prompts `Llama-3.1-8B-Instruct` to add more dramatization and interruptions in the conversations. + +There is also a key factor here: we return a tuple of conversation which makes our lives easier later. Yes, studying Data Structures 101 was actually useful for once! + +For our TTS logic, we use two different models that behave differently with certain prompts. So we prompt the model to add specifics for each speaker accordingly. + +Please again try changing the system prompt and see if you can improve the results. We encourage testing the feather light 3B and 1B models as well at this stage + +- Notebook 4: + +Finally, we take the results from last notebook and convert them into a podcast. We use the `parler-tts/parler-tts-mini-v1` and `bark/suno` models for a conversation. + +The speakers and the prompt for parler model were decided based on experimentation and suggestions from the model authors. Please try experimenting, you can find more details in the resources section. + + +#### Note: Right now there is one issue: Parler needs transformers 4.43.3 or earlier and for steps 1 to 3 of the pipeline you need latest, so we just switch versions in the last notebook. + +### Next-Improvements/Further ideas: + +- Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline and with the help of someone more knowledgable-PRs are welcome! :) +- LLM vs LLM Debate: Another approach of writing the podcast would be having two agents debate the topic of interest and write the podcast outline. Right now we use a single LLM (70B) to write the podcast outline +- Testing 405B for writing the transcripts +- Better prompting +- Support for ingesting a website, audio file, YouTube links and more. Again, we welcome community PRs! + +### Resources for further learning: + +- https://betterprogramming.pub/text-to-audio-generation-with-bark-clearly-explained-4ee300a3713a +- https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing +- https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing#scrollTo=NyYQ--3YksJY +- https://replicate.com/suno-ai/bark?prediction=zh8j6yddxxrge0cjp9asgzd534 +- https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c + diff --git a/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb b/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb new file mode 100644 index 000000000..107ce4822 --- /dev/null +++ b/recipes/quickstart/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb @@ -0,0 +1,2731 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4f67a6a6", + "metadata": {}, + "source": [ + "## Notebook 1: PDF Pre-processing" + ] + }, + { + "cell_type": "markdown", + "id": "f68aee84-04e3-4cbc-be78-6de9e06e704f", + "metadata": {}, + "source": [ + "In the series, we will be going from a PDF to Podcast using all open models. \n", + "\n", + "The first step in getting to the podcast is finding a script, right now our logic is:\n", + "- Use any PDF on any topic\n", + "- Prompt `Llama-3.2-1B-Instruct` model to process it into a text file\n", + "- Re-write this into a podcast transcript in next notebook.\n", + "\n", + "In this notebook, we will upload a PDF and save it into a `.txt` file using the `PyPDF2` library, later we will process chunks from the text file using our featherlight model." + ] + }, + { + "cell_type": "markdown", + "id": "61cb3584", + "metadata": {}, + "source": [ + "Most of us shift-enter pass the comments to realise later we need to install libraries. For the few that read the instructions, please remember to do so:" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "f4fc7aef-3505-482e-a998-790b8b9d48e4", + "metadata": {}, + "outputs": [], + "source": [ + "#!pip install PyPDF2\n", + "#!pip install rich ipywidgets" + ] + }, + { + "cell_type": "markdown", + "id": "7b23d509", + "metadata": {}, + "source": [ + "Assuming you have a PDF uploaded on the same machine, please set the path for the file. \n", + "\n", + "Also, if you want to flex your GPU-please switch to a bigger model although the featherlight models work perfectly for this task:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "60d0061b-8b8c-4353-850f-f19466a0ae2d", + "metadata": {}, + "outputs": [], + "source": [ + "pdf_path = './resources/2402.13116v3.pdf'\n", + "DEFAULT_MODEL = \"meta-llama/Llama-3.2-1B-Instruct\"" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "21029232-ac5f-42ca-b26b-baad5b2f49b7", + "metadata": {}, + "outputs": [], + "source": [ + "import PyPDF2\n", + "from typing import Optional\n", + "import os\n", + "import torch\n", + "from accelerate import Accelerator\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "\n", + "from tqdm.notebook import tqdm\n", + "import warnings\n", + "\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "markdown", + "id": "203c22eb", + "metadata": {}, + "source": [ + "Let's make sure we don't stub our toe by checking if the file exists" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "153d9ece-37a4-4fff-a8e8-53f923a2b0a0", + "metadata": {}, + "outputs": [], + "source": [ + "def validate_pdf(file_path: str) -> bool:\n", + " if not os.path.exists(file_path):\n", + " print(f\"Error: File not found at path: {file_path}\")\n", + " return False\n", + " if not file_path.lower().endswith('.pdf'):\n", + " print(\"Error: File is not a PDF\")\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "markdown", + "id": "5a362ac3", + "metadata": {}, + "source": [ + "Convert PDF to a `.txt` file. This would simply read and dump the contents of the file. We set the maximum characters to 100k. \n", + "\n", + "For people converting their favorite novels into a podcast, they will have to add extra logic of going outside the Llama models context length which is 128k tokens." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "b57c2d64-3d75-4aeb-b4ee-bd1661286b66", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:\n", + " if not validate_pdf(file_path):\n", + " return None\n", + " \n", + " try:\n", + " with open(file_path, 'rb') as file:\n", + " # Create PDF reader object\n", + " pdf_reader = PyPDF2.PdfReader(file)\n", + " \n", + " # Get total number of pages\n", + " num_pages = len(pdf_reader.pages)\n", + " print(f\"Processing PDF with {num_pages} pages...\")\n", + " \n", + " extracted_text = []\n", + " total_chars = 0\n", + " \n", + " # Iterate through all pages\n", + " for page_num in range(num_pages):\n", + " # Extract text from page\n", + " page = pdf_reader.pages[page_num]\n", + " text = page.extract_text()\n", + " \n", + " # Check if adding this page's text would exceed the limit\n", + " if total_chars + len(text) > max_chars:\n", + " # Only add text up to the limit\n", + " remaining_chars = max_chars - total_chars\n", + " extracted_text.append(text[:remaining_chars])\n", + " print(f\"Reached {max_chars} character limit at page {page_num + 1}\")\n", + " break\n", + " \n", + " extracted_text.append(text)\n", + " total_chars += len(text)\n", + " print(f\"Processed page {page_num + 1}/{num_pages}\")\n", + " \n", + " final_text = '\\n'.join(extracted_text)\n", + " print(f\"\\nExtraction complete! Total characters: {len(final_text)}\")\n", + " return final_text\n", + " \n", + " except PyPDF2.PdfReadError:\n", + " print(\"Error: Invalid or corrupted PDF file\")\n", + " return None\n", + " except Exception as e:\n", + " print(f\"An unexpected error occurred: {str(e)}\")\n", + " return None\n" + ] + }, + { + "cell_type": "markdown", + "id": "e023397b", + "metadata": {}, + "source": [ + "Helper function to grab meta info about our PDF" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "0984bb1e-d52c-4cec-a131-67a48061fabc", + "metadata": {}, + "outputs": [], + "source": [ + "# Get PDF metadata\n", + "def get_pdf_metadata(file_path: str) -> Optional[dict]:\n", + " if not validate_pdf(file_path):\n", + " return None\n", + " \n", + " try:\n", + " with open(file_path, 'rb') as file:\n", + " pdf_reader = PyPDF2.PdfReader(file)\n", + " metadata = {\n", + " 'num_pages': len(pdf_reader.pages),\n", + " 'metadata': pdf_reader.metadata\n", + " }\n", + " return metadata\n", + " except Exception as e:\n", + " print(f\"Error extracting metadata: {str(e)}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "id": "6019affc", + "metadata": {}, + "source": [ + "Finally, we can run our logic to extract the details from the file" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "63848943-79cc-4e21-8396-6eab5df493e0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracting metadata...\n", + "\n", + "PDF Metadata:\n", + "Number of pages: 44\n", + "Document info:\n", + "/Author: \n", + "/CreationDate: D:20240311015030Z\n", + "/Creator: LaTeX with hyperref\n", + "/Keywords: \n", + "/ModDate: D:20240311015030Z\n", + "/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5\n", + "/Producer: pdfTeX-1.40.25\n", + "/Subject: \n", + "/Title: \n", + "/Trapped: /False\n", + "\n", + "Extracting text...\n", + "Processing PDF with 44 pages...\n", + "Processed page 1/44\n", + "Processed page 2/44\n", + "Processed page 3/44\n", + "Processed page 4/44\n", + "Processed page 5/44\n", + "Processed page 6/44\n", + "Processed page 7/44\n", + "Processed page 8/44\n", + "Processed page 9/44\n", + "Processed page 10/44\n", + "Processed page 11/44\n", + "Processed page 12/44\n", + "Processed page 13/44\n", + "Processed page 14/44\n", + "Processed page 15/44\n", + "Processed page 16/44\n", + "Reached 100000 character limit at page 17\n", + "\n", + "Extraction complete! Total characters: 100016\n", + "\n", + "Preview of extracted text (first 500 characters):\n", + "--------------------------------------------------\n", + "1\n", + "A Survey on Knowledge Distillation of Large\n", + "Language Models\n", + "Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1,\n", + "Can Xu5, Dacheng Tao6, Tianyi Zhou2\n", + "1The University of Hong Kong2University of Maryland3Microsoft\n", + "4University of Technology Sydney5Peking University6The University of Sydney\n", + "{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu\n", + "ckcheng@cs.hku.hk jl0725@connect.hku.hk\n", + "Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati\n", + "--------------------------------------------------\n", + "\n", + "Total characters extracted: 100016\n", + "\n", + "Extracted text has been saved to extracted_text.txt\n" + ] + } + ], + "source": [ + "# Extract metadata first\n", + "print(\"Extracting metadata...\")\n", + "metadata = get_pdf_metadata(pdf_path)\n", + "if metadata:\n", + " print(\"\\nPDF Metadata:\")\n", + " print(f\"Number of pages: {metadata['num_pages']}\")\n", + " print(\"Document info:\")\n", + " for key, value in metadata['metadata'].items():\n", + " print(f\"{key}: {value}\")\n", + "\n", + "# Extract text\n", + "print(\"\\nExtracting text...\")\n", + "extracted_text = extract_text_from_pdf(pdf_path)\n", + "\n", + "# Display first 500 characters of extracted text as preview\n", + "if extracted_text:\n", + " print(\"\\nPreview of extracted text (first 500 characters):\")\n", + " print(\"-\" * 50)\n", + " print(extracted_text[:500])\n", + " print(\"-\" * 50)\n", + " print(f\"\\nTotal characters extracted: {len(extracted_text)}\")\n", + "\n", + "# Optional: Save the extracted text to a file\n", + "if extracted_text:\n", + " output_file = 'extracted_text.txt'\n", + " with open(output_file, 'w', encoding='utf-8') as f:\n", + " f.write(extracted_text)\n", + " print(f\"\\nExtracted text has been saved to {output_file}\")" + ] + }, + { + "cell_type": "markdown", + "id": "946d1f59", + "metadata": {}, + "source": [ + "### Llama Pre-Processing\n", + "\n", + "Now let's proceed to justify our distaste for writing regex and use that as a justification for a LLM instead:\n", + "\n", + "At this point, have a text file extracted from a PDF of a paper. Generally PDF extracts can be messy due to characters, formatting, Latex, Tables, etc. \n", + "\n", + "One way to handle this would be using regex, instead we can also prompt the feather light Llama models to clean up our text for us. \n", + "\n", + "Please try changing the `SYS_PROMPT` below to see what improvements you can make:" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "7c0828a5-964d-475e-b5f5-40a04e287725", + "metadata": {}, + "outputs": [], + "source": [ + "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + "SYS_PROMPT = \"\"\"\n", + "You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.\n", + "\n", + "The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.\n", + "\n", + "Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive\n", + "\n", + "Please be smart with what you remove and be creative ok?\n", + "\n", + "Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RE-WRITING WHEN NEEDED\n", + "\n", + "Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.\n", + "\n", + "PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES\n", + "\n", + "ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?\n", + "Here is the text:\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "fd393fae", + "metadata": {}, + "source": [ + "Instead of having the model process the entire file at once, as you noticed in the prompt-we will pass chunks of the file. \n", + "\n", + "One issue with passing chunks counted by characters is, we lose meaning of words so instead we chunk by words:" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "24e8a547-9d7c-4e2f-be9e-a3aea09cce76", + "metadata": {}, + "outputs": [], + "source": [ + "def create_word_bounded_chunks(text, target_chunk_size):\n", + " \"\"\"\n", + " Split text into chunks at word boundaries close to the target chunk size.\n", + " \"\"\"\n", + " words = text.split()\n", + " chunks = []\n", + " current_chunk = []\n", + " current_length = 0\n", + " \n", + " for word in words:\n", + " word_length = len(word) + 1 # +1 for the space\n", + " if current_length + word_length > target_chunk_size and current_chunk:\n", + " # Join the current chunk and add it to chunks\n", + " chunks.append(' '.join(current_chunk))\n", + " current_chunk = [word]\n", + " current_length = word_length\n", + " else:\n", + " current_chunk.append(word)\n", + " current_length += word_length\n", + " \n", + " # Add the last chunk if it exists\n", + " if current_chunk:\n", + " chunks.append(' '.join(current_chunk))\n", + " \n", + " return chunks" + ] + }, + { + "cell_type": "markdown", + "id": "5d74223f", + "metadata": {}, + "source": [ + "Let's load in the model and start processing the text chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "d04a4f07-b0b3-45ca-8f41-a433e1abe050", + "metadata": {}, + "outputs": [], + "source": [ + "accelerator = Accelerator()\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " DEFAULT_MODEL,\n", + " torch_dtype=torch.bfloat16,\n", + " use_safetensors=True,\n", + " device_map=device,\n", + ")\n", + "tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)\n", + "model, tokenizer = accelerator.prepare(model, tokenizer)" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "bbda5241-e890-4402-87dd-514d6761bb9c", + "metadata": {}, + "outputs": [], + "source": [ + "def process_chunk(text_chunk, chunk_num):\n", + " \"\"\"Process a chunk of text and return both input and output for verification\"\"\"\n", + " conversation = [\n", + " {\"role\": \"system\", \"content\": SYS_PROMPT},\n", + " {\"role\": \"user\", \"content\": text_chunk},\n", + " ]\n", + " \n", + " prompt = tokenizer.apply_chat_template(conversation, tokenize=False)\n", + " inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n", + " \n", + " with torch.no_grad():\n", + " output = model.generate(\n", + " **inputs,\n", + " temperature=0.7,\n", + " top_p=0.9,\n", + " max_new_tokens=512\n", + " )\n", + " \n", + " processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()\n", + " \n", + " # Print chunk information for monitoring\n", + " #print(f\"\\n{'='*40} Chunk {chunk_num} {'='*40}\")\n", + " print(f\"INPUT TEXT:\\n{text_chunk[:500]}...\") # Show first 500 chars of input\n", + " print(f\"\\nPROCESSED TEXT:\\n{processed_text[:500]}...\") # Show first 500 chars of output\n", + " print(f\"{'='*90}\\n\")\n", + " \n", + " return processed_text" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "a0183c47-339d-4041-ae83-77fc34931075", + "metadata": {}, + "outputs": [], + "source": [ + "INPUT_FILE = \"./resources/extracted_text.txt\" # Replace with your file path\n", + "CHUNK_SIZE = 1000 # Adjust chunk size if needed\n", + "\n", + "chunks = create_word_bounded_chunks(text, CHUNK_SIZE)\n", + "num_chunks = len(chunks)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "bb36814f-9310-4734-bf54-e16a5032339e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "101" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "num_chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "447188d3-ebf0-42d5-940e-4d7e0d9dbf32", + "metadata": {}, + "outputs": [], + "source": [ + "# Read the file\n", + "with open(INPUT_FILE, 'r', encoding='utf-8') as file:\n", + " text = file.read()\n", + "\n", + "# Calculate number of chunks\n", + "num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE\n", + "\n", + "# Cell 6: Process the file with ordered output\n", + "# Create output file name\n", + "output_file = f\"clean_{os.path.basename(INPUT_FILE)}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "7917dfdd-b3af-44fc-a8c0-2760ace9363e", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b767f45b5e514e7db936cef825af6fce", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Processing chunks: 0%| | 0/101 [00:00 Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.\n" + ] + }, + { + "cell_type": "markdown", + "id": "c32c0d85", + "metadata": {}, + "source": [ + "Note: We will prompt the model to return a list of Tuples to make our life easy in the next stage of using these for Text To Speech Generation" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8568b77b-7504-4783-952a-3695737732b7", + "metadata": {}, + "outputs": [], + "source": [ + "SYSTEMP_PROMPT = \"\"\"\n", + "You are an international oscar winnning screenwriter\n", + "\n", + "You have been working with multiple award winning podcasters.\n", + "\n", + "Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.\n", + "\n", + "Make it as engaging as possible, Speaker 1 and 2 will be simulated by different voice engines\n", + "\n", + "Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc\n", + "\n", + "Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes\n", + "\n", + "Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions\n", + "\n", + "Make sure the tangents speaker 2 provides are quite wild or interesting. \n", + "\n", + "Ensure there are interruptions during explanations or there are \"hmm\" and \"umm\" injected throughout from the Speaker 2.\n", + "\n", + "REMEMBER THIS WITH YOUR HEART\n", + "The TTS Engine for Speaker 1 cannot do \"umms, hmms\" well so keep it straight text\n", + "\n", + "For Speaker 2 use \"umm, hmm\" as much, you can also use [sigh] and [laughs]. BUT ONLY THESE OPTIONS FOR EXPRESSIONS\n", + "\n", + "It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait\n", + "\n", + "Please re-write to make it as characteristic as possible\n", + "\n", + "START YOUR RESPONSE DIRECTLY WITH SPEAKER 1:\n", + "\n", + "STRICTLY RETURN YOUR RESPONSE AS A LIST OF TUPLES OK? \n", + "\n", + "IT WILL START DIRECTLY WITH THE LIST AND END WITH THE LIST NOTHING ELSE\n", + "\n", + "Example of response:\n", + "[\n", + " (\"Speaker 1\", \"Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, and today we're joined by a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI.\"),\n", + " (\"Speaker 2\", \"Hi, I'm excited to be here! So, what is Llama 3.2?\"),\n", + " (\"Speaker 1\", \"Ah, great question! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version, with improved performance, efficiency, and customization options.\"),\n", + " (\"Speaker 2\", \"That sounds amazing! What are some of the key features of Llama 3.2?\")\n", + "]\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "8ee70bee", + "metadata": {}, + "source": [ + "This time we will use the smaller 8B model" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "ebef919a-9bc7-4992-b6ff-cd66e4cb7703", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = \"meta-llama/Llama-3.1-8B-Instruct\"" + ] + }, + { + "cell_type": "markdown", + "id": "f7bc794b", + "metadata": {}, + "source": [ + "Let's import the necessary libraries" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "de29b1fd-5b3f-458c-a2e4-e0341e8297ed", + "metadata": {}, + "outputs": [], + "source": [ + "# Import necessary libraries\n", + "import torch\n", + "from accelerate import Accelerator\n", + "import transformers\n", + "\n", + "from tqdm.notebook import tqdm\n", + "import warnings\n", + "\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "markdown", + "id": "8020c39c", + "metadata": {}, + "source": [ + "We will load in the pickle file saved from previous notebook\n", + "\n", + "This time the `INPUT_PROMPT` to the model will be the output from the previous stage" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4b5d2c0e-a073-46c0-8de7-0746e2b05956", + "metadata": {}, + "outputs": [], + "source": [ + "import pickle\n", + "\n", + "with open('./resources/data.pkl', 'rb') as file:\n", + " INPUT_PROMPT = pickle.load(file)" + ] + }, + { + "cell_type": "markdown", + "id": "c4461926", + "metadata": {}, + "source": [ + "We can again use Hugging Face `pipeline` method to generate text from the model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eec210df-a568-4eda-a72d-a4d92d59f022", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0711c2199ca64372b98b781f8a6f13b7", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/4 [00:00\n", + " \n", + " Your browser does not support the audio element.\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Set up device\n", + "device = \"cuda:7\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + "# Load model and tokenizer\n", + "model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(device)\n", + "tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")\n", + "\n", + "# Define text and description\n", + "text_prompt = \"\"\"\n", + "Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n", + "\"\"\"\n", + "description = \"\"\"\n", + "Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.\n", + "\"\"\"\n", + "# Tokenize inputs\n", + "input_ids = tokenizer(description, return_tensors=\"pt\").input_ids.to(device)\n", + "prompt_input_ids = tokenizer(text_prompt, return_tensors=\"pt\").input_ids.to(device)\n", + "\n", + "# Generate audio\n", + "generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", + "audio_arr = generation.cpu().numpy().squeeze()\n", + "\n", + "# Play audio in notebook\n", + "ipd.Audio(audio_arr, rate=model.config.sampling_rate)" + ] + }, + { + "cell_type": "markdown", + "id": "03c2abc6-4a1d-4318-af6f-0257dd66a691", + "metadata": {}, + "source": [ + "#### Bark Model\n", + "\n", + "Amazing, let's try the same with bark now:\n", + "- We will set the `voice_preset` to our favorite speaker\n", + "- This time we can include expression prompts inside our generation prompt\n", + "- Note you can CAPTILISE words to make the model emphasise on these\n", + "- You can add hyphens to make the model pause on certain words" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a20730f0-13dd-48b4-80b6-7c6ef05a0cc4", + "metadata": {}, + "outputs": [], + "source": [ + "voice_preset = \"v2/en_speaker_6\"\n", + "sampling_rate = 24000" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "246d0cbc-c5d8-4f34-b8e4-dd18a624cdad", + "metadata": {}, + "outputs": [], + "source": [ + "device = \"cuda:7\"\n", + "\n", + "processor = AutoProcessor.from_pretrained(\"suno/bark\")\n", + "\n", + "#model = model.to_bettertransformer()\n", + "#model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16, attn_implementation=\"flash_attention_2\").to(device)\n", + "model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(device)#.to_bettertransformer()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5986510c-4a09-4c24-9344-c98fa16947d9", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text_prompt = \"\"\"\n", + "Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n", + "\"\"\"\n", + "inputs = processor(text_prompt, voice_preset=voice_preset).to(device)\n", + "\n", + "speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)\n", + "Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)" + ] + }, + { + "cell_type": "markdown", + "id": "dd650176-ab17-47a7-8e02-10dc9ca9e852", + "metadata": {}, + "source": [ + "## Bringing it together: Making the Podcast\n", + "\n", + "Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast\n", + "\n", + "Let's load in our pickle file from earlier and proceed:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "b1dca30f-1226-4002-8e02-fd97e78ecc83", + "metadata": {}, + "outputs": [], + "source": [ + "import pickle\n", + "\n", + "with open('./resources/podcast_ready_data.pkl', 'rb') as file:\n", + " PODCAST_TEXT = pickle.load(file)" + ] + }, + { + "cell_type": "markdown", + "id": "c10a3d50-08a7-4786-8e28-8fb6b8b048ab", + "metadata": {}, + "source": [ + "Let's define load in the bark model and set it's hyper-parameters for discussions" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "8db78921-36c7-4388-b1d9-78dff4f972c2", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.\n", + " WeightNorm.apply(module, name, dim)\n", + "/home/sanyambhutani/.conda/envs/final-checking-meta/lib/python3.11/site-packages/transformers/models/encodec/modeling_encodec.py:120: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", + " self.register_buffer(\"padding_total\", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)\n" + ] + } + ], + "source": [ + "bark_processor = AutoProcessor.from_pretrained(\"suno/bark\")\n", + "bark_model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(\"cuda:3\")\n", + "bark_sampling_rate = 24000" + ] + }, + { + "cell_type": "markdown", + "id": "e03e313a-c727-4489-876b-db71920d49cd", + "metadata": {}, + "source": [ + "Now for the Parler model:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6c04a04d-3686-4932-bd45-72d7f518c602", + "metadata": {}, + "outputs": [], + "source": [ + "parler_model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(\"cuda:3\")\n", + "parler_tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "efbe1434-37f3-4f77-a5fb-b39625f5e676", + "metadata": {}, + "outputs": [], + "source": [ + "speaker1_description = \"\"\"\n", + "Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "56f6fa24-fe07-4702-850f-0428bfadd2dc", + "metadata": {}, + "source": [ + "We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "cebfd0f9-8703-4fce-b207-014c6e16cc8a", + "metadata": {}, + "outputs": [], + "source": [ + "generated_segments = []\n", + "sampling_rates = [] # We'll need to keep track of sampling rates for each segment" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "9b333e36-9579-4237-b329-e2911229be42", + "metadata": {}, + "outputs": [], + "source": [ + "device=\"cuda:3\"" + ] + }, + { + "cell_type": "markdown", + "id": "d7b2490c-012f-4e35-8890-cd6a5eaf4cc4", + "metadata": {}, + "source": [ + "Function generate text for speaker 1" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "50323f9e-09ed-4c8c-9020-1511ab775969", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_speaker1_audio(text):\n", + " \"\"\"Generate audio using ParlerTTS for Speaker 1\"\"\"\n", + " input_ids = parler_tokenizer(speaker1_description, return_tensors=\"pt\").input_ids.to(device)\n", + " prompt_input_ids = parler_tokenizer(text, return_tensors=\"pt\").input_ids.to(device)\n", + " generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", + " audio_arr = generation.cpu().numpy().squeeze()\n", + " return audio_arr, parler_model.config.sampling_rate" + ] + }, + { + "cell_type": "markdown", + "id": "3fb5dac8-30a6-4aa2-a983-b5f1df3d56af", + "metadata": {}, + "source": [ + "Function to generate text for speaker 2" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "0e6120ba-5190-4739-97ca-4e8b44dddc5e", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_speaker2_audio(text):\n", + " \"\"\"Generate audio using Bark for Speaker 2\"\"\"\n", + " inputs = bark_processor(text, voice_preset=\"v2/en_speaker_6\").to(device)\n", + " speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)\n", + " audio_arr = speech_output[0].cpu().numpy()\n", + " return audio_arr, bark_sampling_rate\n" + ] + }, + { + "cell_type": "markdown", + "id": "7ea67fd1-9405-4fce-b08b-df5e11d0bf37", + "metadata": {}, + "source": [ + "Helper function to convert the numpy output from the models into audio" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "4482d864-2806-4410-b239-da4b2d0d1340", + "metadata": {}, + "outputs": [], + "source": [ + "def numpy_to_audio_segment(audio_arr, sampling_rate):\n", + " \"\"\"Convert numpy array to AudioSegment\"\"\"\n", + " # Convert to 16-bit PCM\n", + " audio_int16 = (audio_arr * 32767).astype(np.int16)\n", + " \n", + " # Create WAV file in memory\n", + " byte_io = io.BytesIO()\n", + " wavfile.write(byte_io, sampling_rate, audio_int16)\n", + " byte_io.seek(0)\n", + " \n", + " # Convert to AudioSegment\n", + " return AudioSegment.from_wav(byte_io)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "c4dbb3b3-cdd3-4a1f-a60a-661e64a67f53", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[\\n (\"Speaker 1\", \"Welcome to this week\\'s episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we\\'re going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who\\'s new to the topic, and I\\'ll be guiding them through the ins and outs of knowledge distillation. So, let\\'s get started!\"),\\n (\"Speaker 2\", \"Sounds exciting! I\\'ve heard of knowledge distillation, but I\\'m not entirely sure what it\\'s all about. Can you give me a brief overview?\"),\\n (\"Speaker 1\", \"Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model\\'s output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn\\'t need to start from scratch.\"),\\n (\"Speaker 2\", \"Hmm, that sounds interesting. So, it\\'s like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?\"),\\n (\"Speaker 1\", \"The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher\\'s output behavior.\"),\\n (\"Speaker 2\", \"That makes sense, I think. So, it\\'s like the teacher model is saying, \\'Hey, student model, learn from my output, and try to produce similar results.\\' But what about the different approaches to knowledge distillation? I\\'ve heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization.\"),\\n (\"Speaker 1\", \"Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you\\'re trying to tune a piano – you need to adjust the keys to produce the perfect sound.\"),\\n (\"Speaker 2\", \"[laughs] Okay, I think I\\'m starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I\\'m thinking of something like a language model that can generate human-like text...\"),\\n (\"Speaker 1\", \"Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.\"),\\n (\"Speaker 2\", \"Wow, that\\'s fascinating! I\\'m starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting.\"),\\n (\"Speaker 1\", \"Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.\"),\\n (\"Speaker 2\", \"[sigh] I\\'m starting to see the bigger picture now. Knowledge distillation is not just a technique; it\\'s a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that\\'s a pretty big deal.\"),\\n (\"Speaker 1\", \"That\\'s right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence.\"),\\n (\"Speaker 2\", \"Well, I\\'m excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I\\'m looking forward to our next episode!\"),\\n (\"Speaker 1\", \"Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we\\'ve curated a list of papers, articles, and tutorials to help you get started.\"),\\n (\"Speaker 2\", \"And if you\\'re interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let\\'s discuss that further next time.\"),\\n]'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "PODCAST_TEXT" + ] + }, + { + "cell_type": "markdown", + "id": "485b4c9e-379f-4004-bdd0-93a53f3f7ee0", + "metadata": {}, + "source": [ + "Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. \n", + "\n", + "We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "9946e46c-3457-4bf9-9042-b89fa8f5b47a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('Speaker 1',\n", + " \"Welcome to this week's episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we're going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who's new to the topic, and I'll be guiding them through the ins and outs of knowledge distillation. So, let's get started!\"),\n", + " ('Speaker 2',\n", + " \"Sounds exciting! I've heard of knowledge distillation, but I'm not entirely sure what it's all about. Can you give me a brief overview?\"),\n", + " ('Speaker 1',\n", + " \"Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model's output, enabling it to acquire similar capabilities. Think of it like a master chef teaching their apprentice the art of cooking – the apprentice doesn't need to start from scratch.\"),\n", + " ('Speaker 2',\n", + " \"Hmm, that sounds interesting. So, it's like a teacher-student relationship, where the teacher model guides the student model to learn from its output... Umm, can you explain this process in more detail?\"),\n", + " ('Speaker 1',\n", + " \"The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher's output behavior.\"),\n", + " ('Speaker 2',\n", + " \"That makes sense, I think. So, it's like the teacher model is saying, 'Hey, student model, learn from my output, and try to produce similar results.' But what about the different approaches to knowledge distillation? I've heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization.\"),\n", + " ('Speaker 1',\n", + " \"Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model. Imagine you're trying to tune a piano – you need to adjust the keys to produce the perfect sound.\"),\n", + " ('Speaker 2',\n", + " \"[laughs] Okay, I think I'm starting to get it. But can you give me some examples of how these approaches are used in real-world applications? I'm thinking of something like a language model that can generate human-like text...\"),\n", + " ('Speaker 1',\n", + " 'Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.'),\n", + " ('Speaker 2',\n", + " \"Wow, that's fascinating! I'm starting to see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks... Umm, can we talk more about multimodal tasks? That sounds really interesting.\"),\n", + " ('Speaker 1',\n", + " 'Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.'),\n", + " ('Speaker 2',\n", + " \"[sigh] I'm starting to see the bigger picture now. Knowledge distillation is not just a technique; it's a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users... Hmm, that's a pretty big deal.\"),\n", + " ('Speaker 1',\n", + " \"That's right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence.\"),\n", + " ('Speaker 2',\n", + " \"Well, I'm excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I'm looking forward to our next episode!\"),\n", + " ('Speaker 1',\n", + " \"Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we've curated a list of papers, articles, and tutorials to help you get started.\"),\n", + " ('Speaker 2',\n", + " \"And if you're interested in building your own AI model using knowledge distillation, maybe we can even do a follow-up episode on how to get started... Umm, let's discuss that further next time.\")]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import ast\n", + "ast.literal_eval(PODCAST_TEXT)" + ] + }, + { + "cell_type": "markdown", + "id": "5c7b4c11-5526-4b13-b0a2-8ca541c475aa", + "metadata": {}, + "source": [ + "#### Generating the Final Podcast\n", + "\n", + "Finally, we can loop over the Tuple and use our helper functions to generate the audio" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "c640fead-2017-478f-a7b6-1b96105d45d6", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating podcast segments: 6%|███▉ | 1/16 [00:20<05:02, 20.16s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 19%|███████████▋ | 3/16 [01:02<04:33, 21.06s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 31%|███████████████████▍ | 5/16 [01:41<03:30, 19.18s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 44%|███████████████████████████▏ | 7/16 [02:26<03:05, 20.57s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 56%|██████████████████████████████████▉ | 9/16 [03:04<02:13, 19.10s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 69%|█████████████████████████████████████████▉ | 11/16 [03:42<01:31, 18.27s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 81%|█████████████████████████████████████████████████▌ | 13/16 [04:17<00:50, 16.99s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 94%|█████████████████████████████████████████████████████████▏ | 15/16 [04:49<00:15, 15.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", + "Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.\n", + "Generating podcast segments: 100%|█████████████████████████████████████████████████████████████| 16/16 [05:13<00:00, 19.57s/segment]\n" + ] + } + ], + "source": [ + "final_audio = None\n", + "\n", + "for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc=\"Generating podcast segments\", unit=\"segment\"):\n", + " if speaker == \"Speaker 1\":\n", + " audio_arr, rate = generate_speaker1_audio(text)\n", + " else: # Speaker 2\n", + " audio_arr, rate = generate_speaker2_audio(text)\n", + " \n", + " # Convert to AudioSegment (pydub will handle sample rate conversion automatically)\n", + " audio_segment = numpy_to_audio_segment(audio_arr, rate)\n", + " \n", + " # Add to final audio\n", + " if final_audio is None:\n", + " final_audio = audio_segment\n", + " else:\n", + " final_audio += audio_segment" + ] + }, + { + "cell_type": "markdown", + "id": "4fbb2228-8023-44c4-aafe-d6e1d22ff8e4", + "metadata": {}, + "source": [ + "### Output the Podcast\n", + "\n", + "We can now save this as a mp3 file" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "2eeffdb7-875a-45ec-bdd8-c8c5b34f5a7b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<_io.BufferedRandom name='_podcast.mp3'>" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "final_audio.export(\"./resources/_podcast.mp3\", \n", + " format=\"mp3\", \n", + " bitrate=\"192k\",\n", + " parameters=[\"-q:a\", \"0\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26cc56c5-b9c9-47c2-b860-0ea9f05c79af", + "metadata": {}, + "outputs": [], + "source": [ + "#fin" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/recipes/quickstart/NotebookLlama/TTS_Notes.md b/recipes/quickstart/NotebookLlama/TTS_Notes.md new file mode 100644 index 000000000..dc496c305 --- /dev/null +++ b/recipes/quickstart/NotebookLlama/TTS_Notes.md @@ -0,0 +1,116 @@ +### Notes from TTS Experimentation + +For the TTS Pipeline, *all* of the top models from HuggingFace and Reddit were tried. + +The goal was to use the models that were easy to setup and sounded less robotic with ability to include sound effects like laughter, etc. + +#### Parler-TTS + +Minimal code to run their models: + +``` +model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device) +tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1") + +# Define text and description +text_prompt = "This is where the actual words to be spoken go" +description = """ +Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise. +""" + +input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) +prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device) + +generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) +audio_arr = generation.cpu().numpy().squeeze() + +ipd.Audio(audio_arr, rate=model.config.sampling_rate) +``` + +The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs. + +Surprisingly, Parler's mini model sounded more natural. + +In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt. + +#### Suno/Bark + +Minimal code to run bark: + +``` +voice_preset = "v2/en_speaker_6" +sampling_rate = 24000 + +text_prompt = """ +Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources. +""" +inputs = processor(text_prompt, voice_preset=voice_preset).to(device) + +speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) +Audio(speech_output[0].cpu().numpy(), rate=sampling_rate) +``` + +Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers. + +v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark. + +The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasize them. + +Adding `-` gives a break in the text. We utilize this knowledge when we re-write the transcript using the 8B model to add effects to our transcript. + +Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hyphen during trails. + +#### Hyper-parameters: + +Bark models have two parameters we can tweak: `temperature` and `semantic_temperature` + +Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below: + +First, fix `temperature` and sweep `semantic_temperature` +- `0.7`, `0.2`: Quite bland and boring +- `0.7`, `0.3`: An improvement over the previous one +- `0.7`, `0.4`: Further improvement +- `0.7`, `0.5`: This one didn't work +- `0.7`, `0.6`: So-So, didn't stand out +- `0.7`, `0.7`: The best so far +- `0.7`, `0.8`: Further improvement +- `0.7`, `0.9`: Mix feelings on this one + +Now sweeping the `temperature` +- `0.1`, `0.9`: Very Robotic +- `0.2`, `0.9`: Less Robotic but not convincing +- `0.3`, `0.9`: Slight improvement still not fun +- `0.4`, `0.9`: Still has a robotic tinge +- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing +- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste +- `0.7`, `0.9`: Very robotic and laugh was weird +- `0.8`, `0.9`: Completely ignore the laughter but it was more natural +- `0.9`, `0.9`: We have a winner probably + +After this about ~30 more sweeps were done with the promising combinations: + +Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) +Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)``` + + +### Notes from other models that were tested: + +Promising directions to explore in future: + +- [MeloTTS](https://huggingface.co/myshell-ai/MeloTTS-English) This is most popular (ever) on HuggingFace +- [WhisperSpeech](https://huggingface.co/WhisperSpeech/WhisperSpeech) sounded quite natural as well +- [F5-TTS](https://github.com/SWivid/F5-TTS) was the latest release at this time, however, it felt a bit robotic +- E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test +- [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising + +#### Some more models that weren't tested: + +In other words, we leave this as an exercise to readers :D + +- [Fish-Speech](https://huggingface.co/fishaudio/fish-speech-1.4) +- [MMS-TTS-Eng](https://huggingface.co/facebook/mms-tts-eng) +- [Metavoice](https://huggingface.co/metavoiceio/metavoice-1B-v0.1) +- [Hifigan](https://huggingface.co/nvidia/tts_hifigan) +- [TTS-Tacotron2](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech) +- [MMS-TTS-Eng](https://huggingface.co/facebook/mms-tts-eng) +- [VALL-E X](https://github.com/Plachtaa/VALL-E-X) diff --git a/recipes/quickstart/NotebookLlama/requirements.txt b/recipes/quickstart/NotebookLlama/requirements.txt new file mode 100644 index 000000000..34a27dc81 --- /dev/null +++ b/recipes/quickstart/NotebookLlama/requirements.txt @@ -0,0 +1,15 @@ +# Core dependencies +PyPDF2>=3.0.0 +torch>=2.0.0 +transformers>=4.46.0 +accelerate>=0.27.0 +rich>=13.0.0 +ipywidgets>=8.0.0 +tqdm>=4.66.0 + +# Optional but recommended +jupyter>=1.0.0 +ipykernel>=6.0.0 + +# Warning handling +warnings>=0.1.0 \ No newline at end of file diff --git a/recipes/quickstart/NotebookLlama/resources/2402.13116v4.pdf b/recipes/quickstart/NotebookLlama/resources/2402.13116v4.pdf new file mode 100644 index 000000000..bf6ab0cc0 Binary files /dev/null and b/recipes/quickstart/NotebookLlama/resources/2402.13116v4.pdf differ diff --git a/recipes/quickstart/NotebookLlama/resources/Outline.jpg b/recipes/quickstart/NotebookLlama/resources/Outline.jpg new file mode 100644 index 000000000..bdb3d9b81 Binary files /dev/null and b/recipes/quickstart/NotebookLlama/resources/Outline.jpg differ diff --git a/recipes/quickstart/NotebookLlama/resources/_podcast.mp3 b/recipes/quickstart/NotebookLlama/resources/_podcast.mp3 new file mode 100644 index 000000000..ba34381b8 Binary files /dev/null and b/recipes/quickstart/NotebookLlama/resources/_podcast.mp3 differ diff --git a/recipes/quickstart/NotebookLlama/resources/clean_extracted_text.txt b/recipes/quickstart/NotebookLlama/resources/clean_extracted_text.txt new file mode 100644 index 000000000..fccc6b2ae --- /dev/null +++ b/recipes/quickstart/NotebookLlama/resources/clean_extracted_text.txt @@ -0,0 +1,74 @@ +=============== + +Knowledge Distillation is a methodology that transfers advanced capabilities from leading proprietary Large Language Models (LLMs) to their open-source counterparts, such as LLaMA and Mistral. This paper presents a comprehensive survey of KD's role in imparting advanced knowledge. + +Abstract —In the era of Large Language Models, Knowledge Distillation emerges as a pivotal methodology for transferring advanced capabilities from proprietary LLMs to open-source counterparts, facilitating their self-improvement by employing themselves as teachers. +xamined through a meticulous survey that delves into the foundational pillars of algorithm, skill, and verticalization, which form the backbone of knowledge distillation and deep learning models. The survey provides a comprehensive examination of key mechanisms within the knowledge distillation framework, specifically focusing on the enhancement of cognitive abilities and their practical implications across various fields, with a particular emphasis on the interplay between data augmentation (DA) and knowledge distillation. +en-source LLMs, this survey highlights the potential for more accessible, efficient, and powerful AI solutions. + +Most importantly, we advocate for compliance with legal terms that regulate the use of LLMs, ensuring ethical and lawful application of knowledge distillation. + +An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms - Large language models, knowledge distillation, data augmentation, skill distillation, supervised fine-tuning +sophisticated problem-solving capabilities, the core significance of these large language models (LLMs) lies in their emergent abilities, enabling them to tackle a diverse array of tasks with remarkable proficiency. +their remarkable capabilities, have some notable limitations, particularly when considering the advantages offered by open-source models, such as GPT-4 and Gemini. These models are often expensive, with substantial usage fees and restricted access, making them inaccessible to individuals and smaller organizations. +ng restrictions and costs. In contrast, open-source LLMs like LLaMA and Mistral bring several advantages. Accessibility and adaptability are key benefits, as they are more readily available to a broader range of users, including researchers and organizations. +ts. One of the most significant limitations is the smaller model scale, resulting in lower performance on real-world tasks with multiple instructions (Zheng et al., 2023a). Models with fewer parameters struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4. Additionally, the pre-training investment in these open-source models is typically less substantial. This reduced investment can lead to a narrower range of pre-training data, potentially limiting their understanding and handling of diverse or specialized topics (Liang et al., 2022; Sun et al., 2024a). Fine-tuning steps are often fewer due to resource constraints, hindering model optimization for specific tasks or industries. +ary models becomes apparent when compared to highly fine-tuned proprietary LLMs. Primarily, the disparity between proprietary and open-source LLMs becomes evident, with proprietary models excelling in complex scenarios, while open-source models excel in a wide range of scenarios. Knowledge distillation, a technique that leverages the advanced capabilities of proprietary models, is used to enhance the competencies of open-source models. This process is similar to transferring the performance of a skilled teacher to a student. +tillation of LLMs, where a small seed of knowledge is used to prompt the LLM to generate more data with respect to a specific skill or domain (Taori et al., 2023). Furthermore, KD retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance. +advanced context following and instruction following** + +**key aspects of knowledge distillation** + +* **contextual understanding**: in-context learning and instruction following +* **alignment with user intents**: human values/principles and thinking patterns like chain-of-thought +* **NLP task specialization**: semantic understanding and code generation + +**critical skills for various applications** + +* **healthcare**: accuracy and contextual knowledge +* **law**: contextual knowledge and precision +* **science**: contextual knowledge and precision +ned in the era of LLMs, the benefits of knowledge distillation in the era of LLMs are multifaceted and transformative. Through a suite of distillation techniques, the gap between proprietary and open-source models narrows and is filled. This process streamlines computational requirements and enhances environmental sustainability of AI operations, as open-source models become more proficient with lower overhead. +ch domains. The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI and the increasing complexity of these models. The ability to efficiently and effectively distill knowledge from proprietary LLMs to open-source ones becomes a practical necessity. This is driven by the need to bridge the knowledge gap between the proprietary and open-source LLMs. + +This need is driven by the 3 models mentioned, including Student, Vicuna, Opt, GPT, and others. These models are being used in various sectors such as law, healthcare, finance, and science, and the ability to distill knowledge from them is becoming increasingly important. +synthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,Y demonstrationsexpandFeature featureinput,outputextractSec.4Sec.5 Sec.3.1Sec.3.2 Fig. 2: An overview of this survey on knowledge distillation of large language models +es emerging, but there is still much to be learned from the era of Large Language Models (LLMs). In this section, we provide a foundational overview of knowledge distillation, highlighting the role of data augmentation (DA) in this context. + +Traditional techniques, such as supervised fine-tuning, have shown promise in distilling knowledge from LLMs. However, the increasing complexity of these models requires careful consideration of the trade-offs between accuracy and computational resources. To further explore the possibilities of knowledge distillation, we examine methods involving supervised fine-tuning, such as incremental learning and transfer learning. + +Supervised fine-tuning involves training a model on a smaller dataset with the goal of adapting to a specific task or domain. This approach has shown significant improvement in various NLP tasks, but may not be scalable to large-scale applications. In contrast, transfer learning offers a more flexible approach, where a model is trained on a smaller dataset and then fine-tuned on a larger dataset. This can lead to improved performance on a variety of tasks, but requires careful selection of the target dataset. + +Another approach is divergence and similarity, which involve exploring the differences and similarities between the knowledge distillation process and traditional machine learning. Reinforcement learning and ranking optimization are also gaining attention, particularly in the context of knowledge distillation, where the goal is to optimize the distillation process itself. These methods can improve the efficiency and effectiveness of knowledge distillation, but require careful consideration of the trade-offs between exploration and exploitation. + +Skill distillation focuses on enhancing student models to improve their understanding of the task and their ability to perform well on NLP tasks. This can be achieved through various methods, including data augmentation, feature learning, and attention mechanisms. By incorporating these techniques, student models can better understand the context and intentions of the user, leading to improved performance across a variety of tasks. + +We propose several strategies for skill distillation, including: +mmendation systems, and the evaluation of text generation. In §5, we delve into domain-specific vertical distillation, demonstrating how knowledge distillation techniques are applied in specialized fields such as law, healthcare, finance, and science, highlighting their practical implications and transformative impact. The survey reveals open problems in §6, highlighting current challenges and gaps in knowledge distillation research that present opportunities for future work. +large, complex model to a smaller, more efficient model, mitigating the challenges of computational demands and resource constraints in deploying large-scale models in practical applications. This process, prior to the era of Large Language Models (LLMs), focused on compacting complex neural networks for deployment in resource-constrained environments, such as mobile devices or edge computing platforms, where computational efficiency was paramount. +al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023) Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a), WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b), CurationUltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a), Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), WaveCoder (Yu et al., 2024), ZeroGen (Ye et al., 2022), InPars (Bonifacio et al., 2022) +Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022), Divergence and SimilarityDistilGPT (Sanh et al., 2019), f-Distill (Wen et al., 2023), MiniLLM (Gu et al., 2024) TED (Liang et al., 2023a), GKD (Agarwal et al., 2024), BabyLlama (Timiryasov and Tastet, 2023) Reinforcement LearningCAI (Bai et al., 2022a), UltraFeedback (Cui et al., 2023a), WizardMath (Luo et al., 2023b), MiniLLM (Gu et al., 2024), GKD (Agarwal et al., 2024), GPT3 Reward (Kwon et al., 2023) Rank Optimization +ollowingInstruction FollowingSelf-Instruct Wang et al., 2022a, Alpaca Taori et al., 2023, Vicuna Chiang et al., 2023, WizardLM Xu et al., 2023a, Orca Mukherjee et al., 2023, Orca2 Mitra et al., 2023, WizardMath Luo et al., 2023b, Llama-GPT4 Peng et al., 2023a, Multi-turn Dialogue Chiang et al., 2023, Baize Xu et al., 2023b, UltraLLaMA Ding et al., 2023b, CAMEL Li et al., 2023b, OpenChat Wang et al., 2023c, Zephyr Tunstall et al., 2023, RAG Kang et al., 2023a, SAIL Luo et al., 2023c, Self-RAG Asai et al., 2023, AlignmentThinking PatternYe et al., 2023, Orca Mukherjee et al., 2023, Orca2 Wang et al., 2023d, AFT Cheng et al., 2023, KnowPAT Zhang et al., 2023a, PreferenceCAI Bai et al., 2022a, GPT-3 Reward Kwon et al., 2023, ILF Scheurer et al., 2023, ALMoST Kim et al., 2023a, RLEF Roit et al., 2023 +i et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a), AgentToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a), Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (Shen et al., 2024), PlanningFireAct (Chen et al., 2023b), AgentTuning (Zeng et al., 2023a), Lumos (Yin et al., 2023a), AUTOACT (Qiao et al., 2024), TPTU-v2 (Kong et al., 2023), NLP Task SpecializationNLUAugGPT (Dai et al., 2023a), GPT Annotation (Gilardi et al., 2023), (Ding et al., 2023a), TDG (He et al., 2023b), SunGen (Gao et al., 2023a), Mix Distill (Chenglin et al., 2023), Annollm (He et al., 2023a), UDG (Wang et al., 2021a), ZeroGen (Ye et al., 2024) +al., 2023 GPT-3 Labeling Wang et al., 2021b BioGPT Guo et al., 2023a ChatGPT NMT Yang and Nicolai, 2023 Information RetrievalQUILL Srinivasan et al., 2022 Promptgator Dai et al., 2023b InPars Bonifacio et al., 2022 AugTriever Meng et al., 2023 Sun et al., 2023a RankVicuna Pradeep et al., 2023a RankZephyr Pradeep et al., 2023b ExaRanker Ferraretto et al., 2023 Recommendation NDR Mysore et al., 2023 InstrcutRec Zhang et al., 2023b ONCE Liu et al., 2023c Text Generation Evaluation PandaLM Wang et al., 2023b Prometheus Kim et al., 2024 InstructScore Xu et al., 2023d TigerScore Jiang et al., 2023c Auto-J Li et al., 2024a CodeCodeAlpaca Chaudhary, 2023 CodeLlama Rozi `ere et al., 2023 Magicoder Wei et al., 2023 Phi-1 Gunasekar et al., 2023 PERsD Chen et al., 2023 MFTCoder Liu et al., 2023d WaveCoder Yu et al., 2023 +et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen et al., 2023d); Finance (Zhang and Yang, 2023); Science (Xie et al., 2023a; Zhang et al., 2024) and Misc. (Dan et al., 2023; Guo et al., 2023b) Fig. 3: Taxonomy of Knowledge Distillation of Large Language Models" +r network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher. + +The distillation of knowledge from larger models to smaller ones is a technique used to improve the performance of AI models. In this context, distillation refers to the process of distilling the knowledge from a larger model into a smaller model, allowing it to learn from the teacher model's output. + +The current era of knowledge distillation in large language models (LLMs) has shifted the focus from mere architecture compression to a more nuanced process of knowledge elicitation and transfer. This paradigm change is largely due to the immense knowledge that LLMs like GPT-4 and Gemini possess. The parameters of LLMs make it challenging to compress them using pruning or quantization techniques. +size, the current focus in llm-based knowledge distillation is to extract and transfer the rich, nuanced understanding that these models have developed the key to this modern approach lies in carefully designed prompts that elicit specific knowledge or capabilities from the llms, tapping into their understanding and capabilities in various domains ranging from natural language understanding to more complex cognitive tasks like reasoning and problem-solving +explicit training objectives. This era of knowledge distillation also emphasizes the transfer of abstract qualities such as reasoning patterns and preference alignment. This is in stark contrast to the earlier focus on output replication, indicating a shift towards a more holistic and comprehensive transfer of cognitive capabilities. The current techniques involve not just the replication of outputs, but also the emulation of thought processes and decision-making patterns of the teacher model. This involves complex strategies like chain-of-thought prompting, where the student model learns the reasoning process of the teacher, enhancing its problem-solving and decision-making capabilities. 2.2 Relation to Data Augmentation (DA) +llation, Unlike traditional techniques such as paraphrasing, or back-translation, which primarily aim at expanding the training dataset in a somewhat mechanical manner. DA within the context of LLMs focuses on the generation of novel, context-rich training data tailored to specific domains and skills. This innovation is driven by the unique capabilities of LLMs to generate coherent, diverse, and intricate data samples that closely mimic the nuanced understanding and cognitive abilities of human experts in various fields. +ource models, through Deep Learning Models (LLMs) are prompted to create targeted, high-quality datasets that are not merely larger in volume but also rich in diversity and specificity. This approach enables the distillation process to be more effective, ensuring that the distilled models replicate the teacher model's output behavior and embody its deep-seated understanding and cognitive strategies. The significance and necessity of Data Augmentation (DA) for achieving Knowledge Domains (KD) in the LLM era cannot be overstated. DA acts as a force multiplier, enabling the distilled models to acquire and refine capabilities that would otherwise require exponentially larger datasets and computational resources. It facilitates a more nuanced and effective transfer of knowledge, focusing on the qualitative aspects of learning rather than quantitative expansion. +er of LLMs empowers open-source models with the ability to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts thereby democratizing access to advanced AI capabilities and fostering innovation across a broader spectrum of applications and users 2 3 Survey Scope Building on the discussions introduced earlier this survey aims to comprehensively explore the landscape of knowledge distillation within the context of LLMs following a meticulously structured taxonomy as in Figure 3 the survey’s scope is delineated through three primary facets each encapsulating a range of subtopics and methodologies +undations and methodologies of knowledge distillation. It includes an in-depth exploration of processes involved in constructing knowledge from teacher models (e.g., proprietary LLMs) and integrating this knowledge into student models (e.g., open-source LLMs). Under the umbrella of 'knowledge', we delve into strategies such as labeling, expansion, curation, feature understanding, and feedback mechanisms. The exploration seeks to uncover the various ways in which knowledge can be identified, expanded, and curated for effective distillation. This subsection examines learning approaches like supervised fine-tuning, divergence minimization, and reinforcement learning techniques. +ow algorithms enable knowledge transfer, allowing open-source models to replicate and sometimes surpass proprietary capabilities. Skill Distillation examines specific competencies and capabilities enhanced through Knowledge Distillation. Contextual discussions follow (Taori et al., 2023; Luo et al., 2023c), including instruction following and retrieval-augmented generation (RAG) capabilities. Alignment research investigates thinking patterns, persona/preference modeling, and value alignment. The 'agent' category focuses on skills like tool usage and planning. NLP task specialization (Dai et al., 2023a; Jung et al., 2023; Chaudhary, 2023) is examined through lenses like natural language understanding (NLU), natural language processing (NLP). +tion, and Code Generation** + +Finally, the survey explores how Knowledge Distillation (KD) enhances Large Language Models (LLMs) in interpreting and integrating multiple forms of input, enriching their utility and applicability across various contexts. Verticalization Distillation +This section examines the application of KD across diverse domains, providing insights into how distilled LLMs can be tailored for specialized fields such as Law, Medical & Healthcare (Wang et al., 2023a), Finance (Zhang and Yang, 2023), Science (Zhang et al., 2024), among others. This exploration showcases the practical implications of KD techniques and highlights their transformative impact on domain-specific AI solutions. Through detailed analysis and examples, this part aims to demonstrate the versatility and efficacy of KD in adapting LLMs to diverse domains. +stem. by navigating through these facets, this survey endeavors to provide an extensive and nuanced analysis of knowledge distillation in the era of LLMs. it serves as a guide for researchers, practitioners, and enthusiasts in the field, shedding light on current methodologies, challenges, and opportunities for innovation in this rapidly evolving domain. +across a range of applications. + +Distillation Pipeline in LLM Era diff --git a/recipes/quickstart/NotebookLlama/resources/data.pkl b/recipes/quickstart/NotebookLlama/resources/data.pkl new file mode 100644 index 000000000..03b2674a7 Binary files /dev/null and b/recipes/quickstart/NotebookLlama/resources/data.pkl differ diff --git a/recipes/quickstart/NotebookLlama/resources/podcast_ready_data.pkl b/recipes/quickstart/NotebookLlama/resources/podcast_ready_data.pkl new file mode 100644 index 000000000..086162b95 Binary files /dev/null and b/recipes/quickstart/NotebookLlama/resources/podcast_ready_data.pkl differ