diff --git a/chapters/02-preferences.md b/chapters/02-preferences.md index 08c01cc..a923ef4 100644 --- a/chapters/02-preferences.md +++ b/chapters/02-preferences.md @@ -1,2 +1,6 @@ -This is a test of citing [@lambert2023entangled]. \ No newline at end of file +# [Incomplete] Human Preferences for RLHF + +## Questioning the Ability of Preferences + +TODO [@lambert2023entangled]. \ No newline at end of file diff --git a/chapters/03-optimization.md b/chapters/03-optimization.md index e69de29..d34b281 100644 --- a/chapters/03-optimization.md +++ b/chapters/03-optimization.md @@ -0,0 +1,6 @@ + +# [Incomplete] Problem Formulation + +## Maximizing Expected Reward + +## Example: Mitigating Safety \ No newline at end of file diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md index 890148d..2890232 100644 --- a/chapters/04-related-works.md +++ b/chapters/04-related-works.md @@ -1,4 +1,4 @@ -# Key Related Works +# [Incomplete] Key Related Works In this chapter we detail the key papers and projects that got the RLHF field to where it is today. This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today. diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md index 7561250..5d169f2 100644 --- a/chapters/06-preference-data.md +++ b/chapters/06-preference-data.md @@ -1 +1,44 @@ -# Preference Data \ No newline at end of file +# [In progress] Preference Data + +## Collecting Preference Data + +Getting the most out of human data involves iterative training of models, evolving and highly detailed data instructions, translating through data foundry businesses, and other challenges that add up. +The process is difficult for new organizations trying to add human data to their pipelines. +Given the sensitivity, processes that work and improve the models are extracted until the performance runs out. + +### Sourcing and Contracts + +The first step is sourcing the vendor to provide data (or ones own annotators). +Much like acquiring access to cutting-edge Nvidia GPUs, getting access to data providers is also a who-you-know game. If you have credibility in the AI ecosystem, the best data companies will want you on our books for public image and long-term growth options. Discounts are often also given on the first batches of data to get training teams hooked. + +If you’re a new entrant in the space, you may have a hard time getting the data you need quickly. Getting the tail of interested buying parties that Scale AI had to turn away is an option for the new data startups. It’s likely their primary playbook to bootstrap revenue. + +On multiple occasions, I’ve heard of data companies not delivering their data contracted to them without threatening legal or financial action. Others have listed companies I work with as customers for PR even though we never worked with them, saying they “didn’t know how that happened” when reaching out. There are plenty of potential bureaucratic or administrative snags through the process. For example, the default terms on the contracts often prohibit the open sourcing of artifacts after acquisition in some fine print. + +Once a contract is settled the data buyer and data provider agree upon instructions for the task(s) purchased. There are intricate documents with extensive details, corner cases, and priorities for the data. A popular example of data instructions is the one that [OpenAI released for InstructGPT](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#heading=h.21o5xkowgmpj). + +Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters. + +Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place. + +The data is delivered in weekly batches with more data coming later in the contract. For example, when we bought preference data for on-policy models we were training at HuggingFace, we had a 6 week delivery period. The first weeks were for further calibration and the later weeks were when we hoped to most improve our model. + +![Overview of the multi-batch cycle for obtaining human preference data from a vendor.](images/pref-data-timeline.png){#fig:preferences} + +The goal is that by week 4 or 5 we can see the data improving our model. This is something some frontier models have mentioned, such as the 14 stages in the Llama 2 data collection [@touvron2023llama], but it doesn’t always go well. At HuggingFace, trying to do this for the first time with human preferences, we didn’t have the RLHF preparedness to get meaningful bumps on our evaluations. The last weeks came and we were forced to continue to collect preference data generating from endpoints we weren’t confident in. + +After the data is all in, there is plenty of time for learning and improving the model. Data acquisition through these vendors works best when viewed as an ongoing process of achieving a set goal. It requires iterative experimentation, high effort, and focus. It’s likely that millions of the dollars spent on these datasets are “wasted” and not used in the final models, but that is just the cost of doing business. Not many organizations have the bandwidth and expertise to make full use of human data of this style. + +This experience, especially relative to the simplicity of synthetic data, makes me wonder how well these companies will be doing in the next decade. + +Note that this section *does not* mirror the experience for buying human-written instruction data, where the process is less of a time crunch. + +## Synthetic Preferences and LLM-as-a-judge + +TODO + +### Example Prompts + +TODO Cite MT Bench [@zheng2023judging],[@huang2024empirical], including specialized models for LLM as a judge [@kim2023prometheus] + +> Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better. \ No newline at end of file diff --git a/chapters/09-instructions.md b/chapters/09-instruction-tuning.md similarity index 100% rename from chapters/09-instructions.md rename to chapters/09-instruction-tuning.md diff --git a/chapters/10-rejection-sampling.md b/chapters/10-rejection-sampling.md index 47125d9..8b4b267 100644 --- a/chapters/10-rejection-sampling.md +++ b/chapters/10-rejection-sampling.md @@ -15,8 +15,8 @@ WebGPT [@nakano2021webgpt], Anthropic's Helpful and Harmless agent[@bai2022train ## Training Process -A visual overview of the rejection sampling process is included below. -![Rejection sampling overview.](images/rejection-sampling.png) +A visual overview of the rejection sampling process is included below in @fig:rs-overview. +![Rejection sampling overview.](images/rejection-sampling.png){#fig:rs-overview} ### Generating Completions diff --git a/chapters/bib.bib b/chapters/bib.bib index 244e281..676581f 100644 --- a/chapters/bib.bib +++ b/chapters/bib.bib @@ -1,3 +1,4 @@ +# Preferences General ############################################################ @article{lambert2023entangled, title={Entangled preferences: The history and risks of reinforcement learning and human feedback}, author={Lambert, Nathan and Gilbert, Thomas Krendl and Zick, Tom}, @@ -5,6 +6,9 @@ @article{lambert2023entangled year={2023} } +################################################################################################ + +# AI General #################################################################### @book{russell2016artificial, title={Artificial intelligence: a modern approach}, author={Russell, Stuart J and Norvig, Peter}, @@ -12,6 +16,10 @@ @book{russell2016artificial publisher={Pearson} } +################################################################################################ + + +# RLHF Methods #################################################################### @article{gilks1992adaptive, title={Adaptive rejection sampling for Gibbs sampling}, author={Gilks, Walter R and Wild, Pascal}, @@ -23,12 +31,9 @@ @article{gilks1992adaptive publisher={Wiley Online Library} } -@article{nakano2021webgpt, - title={Webgpt: Browser-assisted question-answering with human feedback}, - author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others}, - journal={arXiv preprint arXiv:2112.09332}, - year={2021} -} +################################################################################################ + +# RLHF Core #################################################################### @article{christiano2017deep, title={Deep reinforcement learning from human preferences}, author={Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario}, @@ -44,6 +49,13 @@ @article{stiennon2020learning pages={3008--3021}, year={2020} } + +@article{nakano2021webgpt, + title={Webgpt: Browser-assisted question-answering with human feedback}, + author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others}, + journal={arXiv preprint arXiv:2112.09332}, + year={2021} +} @article{ouyang2022training, title={Training language models to follow instructions with human feedback}, author={Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others}, @@ -72,4 +84,26 @@ @article{touvron2023llama author={Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others}, journal={arXiv preprint arXiv:2307.09288}, year={2023} +} + +# LLM as a Judge #################################################################### +@article{zheng2023judging, + title={Judging llm-as-a-judge with mt-bench and chatbot arena}, + author={Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others}, + journal={Advances in Neural Information Processing Systems}, + volume={36}, + pages={46595--46623}, + year={2023} +} +@inproceedings{kim2023prometheus, + title={Prometheus: Inducing fine-grained evaluation capability in language models}, + author={Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and others}, + booktitle={The Twelfth International Conference on Learning Representations}, + year={2023} +} +@article{huang2024empirical, + title={An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers}, + author={Huang, Hui and Qu, Yingqi and Liu, Jing and Yang, Muyun and Zhao, Tiejun}, + journal={arXiv preprint arXiv:2403.02839}, + year={2024} } \ No newline at end of file diff --git a/images/pref-data-timeline.png b/images/pref-data-timeline.png new file mode 100644 index 0000000..5a57de4 Binary files /dev/null and b/images/pref-data-timeline.png differ diff --git a/images/rlhf-book-share.png b/images/rlhf-book-share.png new file mode 100644 index 0000000..6c7dd33 Binary files /dev/null and b/images/rlhf-book-share.png differ diff --git a/metadata.yml b/metadata.yml index 1a46ee5..3e118a6 100644 --- a/metadata.yml +++ b/metadata.yml @@ -1,5 +1,5 @@ --- -title: Reinforcement Learning from Human Feedback Basics +title: The Basics of Reinforcement Learning from Human Feedback biblio-title: Bibliography reference-section-title: Bibliography author: Nathan Lambert @@ -8,7 +8,7 @@ lang: en-US mainlang: english otherlang: english tags: [rlhf, ebook, ai, ml] -date: 13 August 2024 +date: 21 September 2024 abstract: | Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to the deploy of the lastest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. diff --git a/templates/chapter.html b/templates/chapter.html index c3ba023..8ac04ff 100644 --- a/templates/chapter.html +++ b/templates/chapter.html @@ -5,7 +5,16 @@ -$for(author-meta)$ + + + + + + + + + + $for(author-meta)$ $endfor$ $if(date-meta)$ diff --git a/templates/html.html b/templates/html.html index 01cea7f..f40a288 100644 --- a/templates/html.html +++ b/templates/html.html @@ -5,7 +5,16 @@ -$for(author-meta)$ + + + + + + + + + + $for(author-meta)$ $endfor$ $if(date-meta)$ @@ -56,46 +65,46 @@
Introductions
Problem Setup
Optimization
Advanced (TBD)
Open Questions (TBD)
I would like to thank the following people who helped me with this project: Costa Huang,
-Additionally, thank you to GitHub contributors for bug fixes. contributors on GitHub who helped improve this project.
+Additionally, thank you to the contributors on GitHub who helped improve this project.