Skip to content

Commit

Permalink
Contrib5 (#15)
Browse files Browse the repository at this point in the history
  • Loading branch information
natolambert authored Sep 22, 2024
1 parent 1c42612 commit 079d873
Show file tree
Hide file tree
Showing 12 changed files with 137 additions and 32 deletions.
6 changes: 5 additions & 1 deletion chapters/02-preferences.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@

This is a test of citing [@lambert2023entangled].
# [Incomplete] Human Preferences for RLHF

## Questioning the Ability of Preferences

TODO [@lambert2023entangled].
6 changes: 6 additions & 0 deletions chapters/03-optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

# [Incomplete] Problem Formulation

## Maximizing Expected Reward

## Example: Mitigating Safety
2 changes: 1 addition & 1 deletion chapters/04-related-works.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Key Related Works
# [Incomplete] Key Related Works

In this chapter we detail the key papers and projects that got the RLHF field to where it is today.
This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today.
Expand Down
45 changes: 44 additions & 1 deletion chapters/06-preference-data.md
Original file line number Diff line number Diff line change
@@ -1 +1,44 @@
# Preference Data
# [In progress] Preference Data

## Collecting Preference Data

Getting the most out of human data involves iterative training of models, evolving and highly detailed data instructions, translating through data foundry businesses, and other challenges that add up.
The process is difficult for new organizations trying to add human data to their pipelines.
Given the sensitivity, processes that work and improve the models are extracted until the performance runs out.

### Sourcing and Contracts

The first step is sourcing the vendor to provide data (or ones own annotators).
Much like acquiring access to cutting-edge Nvidia GPUs, getting access to data providers is also a who-you-know game. If you have credibility in the AI ecosystem, the best data companies will want you on our books for public image and long-term growth options. Discounts are often also given on the first batches of data to get training teams hooked.

If you’re a new entrant in the space, you may have a hard time getting the data you need quickly. Getting the tail of interested buying parties that Scale AI had to turn away is an option for the new data startups. It’s likely their primary playbook to bootstrap revenue.

On multiple occasions, I’ve heard of data companies not delivering their data contracted to them without threatening legal or financial action. Others have listed companies I work with as customers for PR even though we never worked with them, saying they “didn’t know how that happened” when reaching out. There are plenty of potential bureaucratic or administrative snags through the process. For example, the default terms on the contracts often prohibit the open sourcing of artifacts after acquisition in some fine print.

Once a contract is settled the data buyer and data provider agree upon instructions for the task(s) purchased. There are intricate documents with extensive details, corner cases, and priorities for the data. A popular example of data instructions is the one that [OpenAI released for InstructGPT](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#heading=h.21o5xkowgmpj).

Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters.

Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place.

The data is delivered in weekly batches with more data coming later in the contract. For example, when we bought preference data for on-policy models we were training at HuggingFace, we had a 6 week delivery period. The first weeks were for further calibration and the later weeks were when we hoped to most improve our model.

![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences}

The goal is that by week 4 or 5 we can see the data improving our model. This is something some frontier models have mentioned, such as the 14 stages in the Llama 2 data collection [@touvron2023llama], but it doesn’t always go well. At HuggingFace, trying to do this for the first time with human preferences, we didn’t have the RLHF preparedness to get meaningful bumps on our evaluations. The last weeks came and we were forced to continue to collect preference data generating from endpoints we weren’t confident in.

After the data is all in, there is plenty of time for learning and improving the model. Data acquisition through these vendors works best when viewed as an ongoing process of achieving a set goal. It requires iterative experimentation, high effort, and focus. It’s likely that millions of the dollars spent on these datasets are “wasted” and not used in the final models, but that is just the cost of doing business. Not many organizations have the bandwidth and expertise to make full use of human data of this style.

This experience, especially relative to the simplicity of synthetic data, makes me wonder how well these companies will be doing in the next decade.

Note that this section *does not* mirror the experience for buying human-written instruction data, where the process is less of a time crunch.

## Synthetic Preferences and LLM-as-a-judge

TODO

### Example Prompts

TODO Cite MT Bench [@zheng2023judging],[@huang2024empirical], including specialized models for LLM as a judge [@kim2023prometheus]

> Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.
File renamed without changes.
4 changes: 2 additions & 2 deletions chapters/10-rejection-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ WebGPT [@nakano2021webgpt], Anthropic's Helpful and Harmless agent[@bai2022train

## Training Process

A visual overview of the rejection sampling process is included below.
![Rejection sampling overview.](images/rejection-sampling.png)
A visual overview of the rejection sampling process is included below in @fig:rs-overview.
![Rejection sampling overview.](images/rejection-sampling.png){#fig:rs-overview}


### Generating Completions
Expand Down
46 changes: 40 additions & 6 deletions chapters/bib.bib
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
# Preferences General ############################################################
@article{lambert2023entangled,
title={Entangled preferences: The history and risks of reinforcement learning and human feedback},
author={Lambert, Nathan and Gilbert, Thomas Krendl and Zick, Tom},
journal={arXiv preprint arXiv:2310.13595},
year={2023}
}

################################################################################################
# AI General ####################################################################
@book{russell2016artificial,
title={Artificial intelligence: a modern approach},
author={Russell, Stuart J and Norvig, Peter},
year={2016},
publisher={Pearson}
}

################################################################################################
# RLHF Methods ####################################################################
@article{gilks1992adaptive,
title={Adaptive rejection sampling for Gibbs sampling},
author={Gilks, Walter R and Wild, Pascal},
Expand All @@ -23,12 +31,9 @@ @article{gilks1992adaptive
publisher={Wiley Online Library}
}

@article{nakano2021webgpt,
title={Webgpt: Browser-assisted question-answering with human feedback},
author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others},
journal={arXiv preprint arXiv:2112.09332},
year={2021}
}
################################################################################################
# RLHF Core ####################################################################
@article{christiano2017deep,
title={Deep reinforcement learning from human preferences},
author={Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario},
Expand All @@ -44,6 +49,13 @@ @article{stiennon2020learning
pages={3008--3021},
year={2020}
}

@article{nakano2021webgpt,
title={Webgpt: Browser-assisted question-answering with human feedback},
author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others},
journal={arXiv preprint arXiv:2112.09332},
year={2021}
}
@article{ouyang2022training,
title={Training language models to follow instructions with human feedback},
author={Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others},
Expand Down Expand Up @@ -72,4 +84,26 @@ @article{touvron2023llama
author={Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others},
journal={arXiv preprint arXiv:2307.09288},
year={2023}
}

# LLM as a Judge ####################################################################
@article{zheng2023judging,
title={Judging llm-as-a-judge with mt-bench and chatbot arena},
author={Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={46595--46623},
year={2023}
}
@inproceedings{kim2023prometheus,
title={Prometheus: Inducing fine-grained evaluation capability in language models},
author={Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and others},
booktitle={The Twelfth International Conference on Learning Representations},
year={2023}
}
@article{huang2024empirical,
title={An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers},
author={Huang, Hui and Qu, Yingqi and Liu, Jing and Yang, Muyun and Zhao, Tiejun},
journal={arXiv preprint arXiv:2403.02839},
year={2024}
}
Binary file added images/pref-data-timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/rlhf-book-share.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions metadata.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Reinforcement Learning from Human Feedback Basics
title: The Basics of Reinforcement Learning from Human Feedback
biblio-title: Bibliography
reference-section-title: Bibliography
author: Nathan Lambert
Expand All @@ -8,7 +8,7 @@ lang: en-US
mainlang: english
otherlang: english
tags: [rlhf, ebook, ai, ml]
date: 13 August 2024
date: 21 September 2024
abstract: |
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to the deploy of the lastest machine learning systems.
In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background.
Expand Down
11 changes: 10 additions & 1 deletion templates/chapter.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,16 @@
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
$for(author-meta)$

<!-- Add Open Graph meta tags for share image -->
<meta property="og:image" content="https://github.com/natolambert/rlhf-book/blob/main/images/rlhf-book-share" />
<meta property="og:image:width" content="1920" />
<meta property="og:image:height" content="1080" />
<meta property="og:title" content="$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$" />
<meta property="og:description" content="The Basics of Reinforcement Learning from Human Feedback" />
<meta property="og:url" content="https://rlhfbook.com" />

$for(author-meta)$
<meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
Expand Down
45 changes: 27 additions & 18 deletions templates/html.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,16 @@
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico">
$for(author-meta)$

<!-- Add Open Graph meta tags for share image -->
<meta property="og:image" content="https://github.com/natolambert/rlhf-book/blob/main/images/rlhf-book-share" />
<meta property="og:image:width" content="1920" />
<meta property="og:image:height" content="1080" />
<meta property="og:title" content="$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$" />
<meta property="og:description" content="The Basics of Reinforcement Learning from Human Feedback" />
<meta property="og:url" content="https://rlhfbook.com" />

$for(author-meta)$
<meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
Expand Down Expand Up @@ -56,46 +65,46 @@ <h1 class="title">$title$</h1>
<div class="section">
<p><strong>Introductions</strong></p>
<ol>
<li><a href="https://rlhfbook.com/c/introduction.html">Introduction</a></li>
<li><a href="https://rlhfbook.com/c/preferences.html">What are preferences?</a></li>
<li><a href="https://rlhfbook.com/c/optimization.html">Optimization and RL</a></li>
<li><a href="https://rlhfbook.com/c/related-works.html">Seminal (Recent) Works</a></li>
<li><a href="https://rlhfbook.com/c/01-introduction.html">Introduction</a></li>
<li><a href="https://rlhfbook.com/c/02-preferences.html">What are preferences?</a></li>
<li><a href="https://rlhfbook.com/c/03-optimization.html">Optimization and RL</a></li>
<li><a href="https://rlhfbook.com/c/04-related-works.html">Seminal (Recent) Works</a></li>
</ol>
</div>

<div class="section">
<p><strong>Problem Setup</strong></p>
<ol>
<li><a href="https://rlhfbook.com/c/setup.html">Definitions</a></li>
<li><a href="https://rlhfbook.com/c/preference-data.html">Preference Data</a></li>
<li><a href="https://rlhfbook.com/c/reward-models.html">Reward Modeling</a></li>
<li><a href="https://rlhfbook.com/c/regularization.html">Regularization</a></li>
<li><a href="https://rlhfbook.com/c/05-setup.html">Definitions</a></li>
<li><a href="https://rlhfbook.com/c/06-preference-data.html">Preference Data</a></li>
<li><a href="https://rlhfbook.com/c/07-reward-models.html">Reward Modeling</a></li>
<li><a href="https://rlhfbook.com/c/08-regularization.html">Regularization</a></li>
</ol>
</div>

<div class="section">
<p><strong>Optimization</strong></p>
<ol>
<li><a href="https://rlhfbook.com/c/instructions.html">Instruction Tuning</a></li>
<li><a href="https://rlhfbook.com/c/rejection-sampling.html">Rejection Sampling</a></li>
<li><a href="https://rlhfbook.com/c/policy-gradients.html">Policy Gradients</a></li>
<li><a href="https://rlhfbook.com/c/direct-alignment.html">Direct Alignment Algorithms</a></li>
<li><a href="https://rlhfbook.com/c/09-instruction-tuning.html">Instruction Tuning</a></li>
<li><a href="https://rlhfbook.com/c/10-rejection-sampling.html">Rejection Sampling</a></li>
<li><a href="https://rlhfbook.com/c/11-policy-gradients.html">Policy Gradients</a></li>
<li><a href="https://rlhfbook.com/c/12-direct-alignment.html">Direct Alignment Algorithms</a></li>
</ol>
</div>

<div class="section">
<p><strong>Advanced (TBD)</strong></p>
<ol>
<li><a href="https://rlhfbook.com/c/cai.html">Constitutional AI</a></li>
<li><a href="https://rlhfbook.com/c/synthetic.html">Synthetic Data</a></li>
<li><a href="https://rlhfbook.com/c/evaluation.html">Evaluation</a></li>
<li><a href="https://rlhfbook.com/c/13-cai.html">Constitutional AI</a></li>
<li><a href="https://rlhfbook.com/c/14-synthetic.html">Synthetic Data</a></li>
<li><a href="https://rlhfbook.com/c/15-evaluation.html">Evaluation</a></li>
</ol>
</div>

<div class="section">
<p><strong>Open Questions (TBD)</strong></p>
<ol>
<li><a href="https://rlhfbook.com/c/over-optimization.html">Over-optimization</a></li>
<li><a href="https://rlhfbook.com/c/16-over-optimization.html">Over-optimization</a></li>
<li>Style</li>
</ol>
</div>
Expand All @@ -119,7 +128,7 @@ <h2>Abstract</h2>
<section id="acknowledgements" style="padding: 20px; text-align: center;">
<h2>Acknowledgements</h2>
<p>I would like to thank the following people who helped me with this project: Costa Huang, </p>
<p>Additionally, thank you to GitHub contributors for bug fixes. <a href="https://github.com/natolambert/rlhf-book/graphs/contributors">contributors on GitHub</a> who helped improve this project.</p>
<p>Additionally, thank you to the <a href="https://github.com/natolambert/rlhf-book/graphs/contributors">contributors on GitHub</a> who helped improve this project.</p>
</section>
<footer style="padding: 20px; text-align: center;">
<hr>
Expand Down

0 comments on commit 079d873

Please sign in to comment.