Contrib5 (#15)

natolambert · Sep 22, 2024 · 079d873 · 079d873
1 parent 1c42612
commit 079d873
Show file tree

Hide file tree

Showing 12 changed files with 137 additions and 32 deletions.
diff --git a/chapters/02-preferences.md b/chapters/02-preferences.md
@@ -1,2 +1,6 @@
 
-This is a test of citing [@lambert2023entangled].
+# [Incomplete] Human Preferences for RLHF
+
+## Questioning the Ability of Preferences
+
+TODO [@lambert2023entangled].
diff --git a/chapters/03-optimization.md b/chapters/03-optimization.md
@@ -0,0 +1,6 @@
+
+# [Incomplete] Problem Formulation
+
+## Maximizing Expected Reward
+
+## Example: Mitigating Safety
diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md
@@ -1,4 +1,4 @@
-# Key Related Works
+# [Incomplete] Key Related Works
 
 In this chapter we detail the key papers and projects that got the RLHF field to where it is today.
 This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today.

diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md
@@ -1 +1,44 @@
-# Preference Data
+# [In progress] Preference Data
+
+## Collecting Preference Data
+
+Getting the most out of human data involves iterative training of models, evolving and highly detailed data instructions, translating through data foundry businesses, and other challenges that add up. 
+The process is difficult for new organizations trying to add human data to their pipelines. 
+Given the sensitivity, processes that work and improve the models are extracted until the performance runs out.
+
+### Sourcing and Contracts
+
+The first step is sourcing the vendor to provide data (or ones own annotators). 
+Much like acquiring access to cutting-edge Nvidia GPUs, getting access to data providers is also a who-you-know game. If you have credibility in the AI ecosystem, the best data companies will want you on our books for public image and long-term growth options. Discounts are often also given on the first batches of data to get training teams hooked.
+
+If you’re a new entrant in the space, you may have a hard time getting the data you need quickly. Getting the tail of interested buying parties that Scale AI had to turn away is an option for the new data startups. It’s likely their primary playbook to bootstrap revenue.
+
+On multiple occasions, I’ve heard of data companies not delivering their data contracted to them without threatening legal or financial action. Others have listed companies I work with as customers for PR even though we never worked with them, saying they “didn’t know how that happened” when reaching out. There are plenty of potential bureaucratic or administrative snags through the process. For example, the default terms on the contracts often prohibit the open sourcing of artifacts after acquisition in some fine print.
+
+Once a contract is settled the data buyer and data provider agree upon instructions for the task(s) purchased. There are intricate documents with extensive details, corner cases, and priorities for the data. A popular example of data instructions is the one that [OpenAI released for InstructGPT](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#heading=h.21o5xkowgmpj).
+
+Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters.
+
+Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place.
+
+The data is delivered in weekly batches with more data coming later in the contract. For example, when we bought preference data for on-policy models we were training at HuggingFace, we had a 6 week delivery period. The first weeks were for further calibration and the later weeks were when we hoped to most improve our model.
+
+![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences}
+
+The goal is that by week 4 or 5 we can see the data improving our model. This is something some frontier models have mentioned, such as the 14 stages in the Llama 2 data collection [@touvron2023llama], but it doesn’t always go well. At HuggingFace, trying to do this for the first time with human preferences, we didn’t have the RLHF preparedness to get meaningful bumps on our evaluations. The last weeks came and we were forced to continue to collect preference data generating from endpoints we weren’t confident in.
+
+After the data is all in, there is plenty of time for learning and improving the model. Data acquisition through these vendors works best when viewed as an ongoing process of achieving a set goal. It requires iterative experimentation, high effort, and focus. It’s likely that millions of the dollars spent on these datasets are “wasted” and not used in the final models, but that is just the cost of doing business. Not many organizations have the bandwidth and expertise to make full use of human data of this style.
+
+This experience, especially relative to the simplicity of synthetic data, makes me wonder how well these companies will be doing in the next decade.
+
+Note that this section *does not* mirror the experience for buying human-written instruction data, where the process is less of a time crunch.
+
+## Synthetic Preferences and LLM-as-a-judge
+
+TODO
+
+### Example Prompts
+
+TODO Cite MT Bench [@zheng2023judging],[@huang2024empirical], including specialized models for LLM as a judge [@kim2023prometheus]
+
+> Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.
diff --git a/chapters/09-instructions.md → chapters/09-instruction-tuning.md b/chapters/09-instructions.md → chapters/09-instruction-tuning.md
diff --git a/chapters/10-rejection-sampling.md b/chapters/10-rejection-sampling.md
@@ -15,8 +15,8 @@ WebGPT [@nakano2021webgpt], Anthropic's Helpful and Harmless agent[@bai2022train
 
 ## Training Process
 
-A visual overview of the rejection sampling process is included below.
-![Rejection sampling overview.](images/rejection-sampling.png)
+A visual overview of the rejection sampling process is included below in @fig:rs-overview.
+![Rejection sampling overview.](images/rejection-sampling.png){#fig:rs-overview}
 
 
 ### Generating Completions

diff --git a/chapters/bib.bib b/chapters/bib.bib
@@ -1,17 +1,25 @@
+# Preferences General ############################################################
 @article{lambert2023entangled,
   title={Entangled preferences: The history and risks of reinforcement learning and human feedback},
   author={Lambert, Nathan and Gilbert, Thomas Krendl and Zick, Tom},
   journal={arXiv preprint arXiv:2310.13595},
   year={2023}
 }
 
+################################################################################################
+
+# AI General ####################################################################
 @book{russell2016artificial,
   title={Artificial intelligence: a modern approach},
   author={Russell, Stuart J and Norvig, Peter},
   year={2016},
   publisher={Pearson}
 }
 
+################################################################################################
+
+
+# RLHF Methods ####################################################################
 @article{gilks1992adaptive,
   title={Adaptive rejection sampling for Gibbs sampling},
   author={Gilks, Walter R and Wild, Pascal},
@@ -23,12 +31,9 @@ @article{gilks1992adaptive
   publisher={Wiley Online Library}
 }
 
-@article{nakano2021webgpt,
-  title={Webgpt: Browser-assisted question-answering with human feedback},
-  author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others},
-  journal={arXiv preprint arXiv:2112.09332},
-  year={2021}
-}
+################################################################################################
+
+# RLHF Core ####################################################################
 @article{christiano2017deep,
   title={Deep reinforcement learning from human preferences},
   author={Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario},
@@ -44,6 +49,13 @@ @article{stiennon2020learning
   pages={3008--3021},
   year={2020}
 }
+
+@article{nakano2021webgpt,
+  title={Webgpt: Browser-assisted question-answering with human feedback},
+  author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others},
+  journal={arXiv preprint arXiv:2112.09332},
+  year={2021}
+}
 @article{ouyang2022training,
   title={Training language models to follow instructions with human feedback},
   author={Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others},
@@ -72,4 +84,26 @@ @article{touvron2023llama
   author={Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others},
   journal={arXiv preprint arXiv:2307.09288},
   year={2023}
+}
+
+# LLM as a Judge ####################################################################
+@article{zheng2023judging,
+  title={Judging llm-as-a-judge with mt-bench and chatbot arena},
+  author={Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others},
+  journal={Advances in Neural Information Processing Systems},
+  volume={36},
+  pages={46595--46623},
+  year={2023}
+}
+@inproceedings{kim2023prometheus,
+  title={Prometheus: Inducing fine-grained evaluation capability in language models},
+  author={Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and others},
+  booktitle={The Twelfth International Conference on Learning Representations},
+  year={2023}
+}
+@article{huang2024empirical,
+  title={An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers},
+  author={Huang, Hui and Qu, Yingqi and Liu, Jing and Yang, Muyun and Zhao, Tiejun},
+  journal={arXiv preprint arXiv:2403.02839},
+  year={2024}
 }
diff --git a/images/pref-data-timeline.png b/images/pref-data-timeline.png
diff --git a/images/rlhf-book-share.png b/images/rlhf-book-share.png
diff --git a/metadata.yml b/metadata.yml
@@ -1,5 +1,5 @@
 ---
-title: Reinforcement Learning from Human Feedback Basics
+title: The Basics of Reinforcement Learning from Human Feedback
 biblio-title: Bibliography
 reference-section-title: Bibliography
 author: Nathan Lambert
@@ -8,7 +8,7 @@ lang: en-US
 mainlang: english
 otherlang: english
 tags: [rlhf, ebook, ai, ml]
-date: 13 August 2024
+date: 21 September 2024
 abstract: |
   Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to the deploy of the lastest machine learning systems.
   In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background.

diff --git a/templates/chapter.html b/templates/chapter.html
@@ -5,7 +5,16 @@
   <meta name="generator" content="pandoc" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
   <link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
-$for(author-meta)$
+
+  <!-- Add Open Graph meta tags for share image -->
+  <meta property="og:image" content="https://github.com/natolambert/rlhf-book/blob/main/images/rlhf-book-share" />
+  <meta property="og:image:width" content="1920" />
+  <meta property="og:image:height" content="1080" />
+  <meta property="og:title" content="$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$" />
+  <meta property="og:description" content="The Basics of Reinforcement Learning from Human Feedback" />
+  <meta property="og:url" content="https://rlhfbook.com" />
+
+  $for(author-meta)$
   <meta name="author" content="$author-meta$" />
 $endfor$
 $if(date-meta)$

diff --git a/templates/html.html b/templates/html.html
@@ -5,7 +5,16 @@
   <meta name="generator" content="pandoc" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
   <link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
-$for(author-meta)$
+
+  <!-- Add Open Graph meta tags for share image -->
+  <meta property="og:image" content="https://github.com/natolambert/rlhf-book/blob/main/images/rlhf-book-share" />
+  <meta property="og:image:width" content="1920" />
+  <meta property="og:image:height" content="1080" />
+  <meta property="og:title" content="$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$" />
+  <meta property="og:description" content="The Basics of Reinforcement Learning from Human Feedback" />
+  <meta property="og:url" content="https://rlhfbook.com" />
+
+  $for(author-meta)$
   <meta name="author" content="$author-meta$" />
 $endfor$
 $if(date-meta)$
@@ -56,46 +65,46 @@ <h1 class="title">$title$</h1>
   <div class="section">
     <p><strong>Introductions</strong></p>
     <ol>
-      <li><a href="https://rlhfbook.com/c/introduction.html">Introduction</a></li>
-      <li><a href="https://rlhfbook.com/c/preferences.html">What are preferences?</a></li>
-      <li><a href="https://rlhfbook.com/c/optimization.html">Optimization and RL</a></li>
-      <li><a href="https://rlhfbook.com/c/related-works.html">Seminal (Recent) Works</a></li>
+      <li><a href="https://rlhfbook.com/c/01-introduction.html">Introduction</a></li>
+      <li><a href="https://rlhfbook.com/c/02-preferences.html">What are preferences?</a></li>
+      <li><a href="https://rlhfbook.com/c/03-optimization.html">Optimization and RL</a></li>
+      <li><a href="https://rlhfbook.com/c/04-related-works.html">Seminal (Recent) Works</a></li>
     </ol>
   </div>
 
   <div class="section">
     <p><strong>Problem Setup</strong></p>
     <ol>
-      <li><a href="https://rlhfbook.com/c/setup.html">Definitions</a></li>
-      <li><a href="https://rlhfbook.com/c/preference-data.html">Preference Data</a></li>
-      <li><a href="https://rlhfbook.com/c/reward-models.html">Reward Modeling</a></li>
-      <li><a href="https://rlhfbook.com/c/regularization.html">Regularization</a></li>
+      <li><a href="https://rlhfbook.com/c/05-setup.html">Definitions</a></li>
+      <li><a href="https://rlhfbook.com/c/06-preference-data.html">Preference Data</a></li>
+      <li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
+      <li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
     </ol>
   </div>
 
   <div class="section">
     <p><strong>Optimization</strong></p>
     <ol>
-      <li><a href="https://rlhfbook.com/c/instructions.html">Instruction Tuning</a></li>
-      <li><a href="https://rlhfbook.com/c/rejection-sampling.html">Rejection Sampling</a></li>
-      <li><a href="https://rlhfbook.com/c/policy-gradients.html">Policy Gradients</a></li>
-      <li><a href="https://rlhfbook.com/c/direct-alignment.html">Direct Alignment Algorithms</a></li>
+      <li><a href="https://rlhfbook.com/c/09-instruction-tuning.html">Instruction Tuning</a></li>
+      <li><a href="https://rlhfbook.com/c/10-rejection-sampling.html">Rejection Sampling</a></li>
+      <li><a href="https://rlhfbook.com/c/11-policy-gradients.html">Policy Gradients</a></li>
+      <li><a href="https://rlhfbook.com/c/12-direct-alignment.html">Direct Alignment Algorithms</a></li>
     </ol>
   </div>
 
   <div class="section">
     <p><strong>Advanced (TBD)</strong></p>
     <ol>
-      <li><a href="https://rlhfbook.com/c/cai.html">Constitutional AI</a></li>
-      <li><a href="https://rlhfbook.com/c/synthetic.html">Synthetic Data</a></li>
-      <li><a href="https://rlhfbook.com/c/evaluation.html">Evaluation</a></li>
+      <li><a href="https://rlhfbook.com/c/13-cai.html">Constitutional AI</a></li>
+      <li><a href="https://rlhfbook.com/c/14-synthetic.html">Synthetic Data</a></li>
+      <li><a href="https://rlhfbook.com/c/15-evaluation.html">Evaluation</a></li>
     </ol>
   </div>
 
   <div class="section">
     <p><strong>Open Questions (TBD)</strong></p>
     <ol>
-      <li><a href="https://rlhfbook.com/c/over-optimization.html">Over-optimization</a></li>
+      <li><a href="https://rlhfbook.com/c/16-over-optimization.html">Over-optimization</a></li>
       <li>Style</li>
     </ol>
   </div>
@@ -119,7 +128,7 @@ <h2>Abstract</h2>
   <section id="acknowledgements" style="padding: 20px; text-align: center;">
     <h2>Acknowledgements</h2>
     <p>I would like to thank the following people who helped me with this project: Costa Huang, </p>
-    <p>Additionally, thank you to GitHub contributors for bug fixes. <a href="https://github.com/natolambert/rlhf-book/graphs/contributors">contributors on GitHub</a> who helped improve this project.</p>
+    <p>Additionally, thank you to the <a href="https://github.com/natolambert/rlhf-book/graphs/contributors">contributors on GitHub</a> who helped improve this project.</p>
   </section>
   <footer style="padding: 20px; text-align: center;">
     <hr>