diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md index 6fc6010..232f1b2 100644 --- a/chapters/04-related-works.md +++ b/chapters/04-related-works.md @@ -14,6 +14,7 @@ Still, many of the techniques used today are deeply related to core techniques f *TAMER: Training an Agent Manually via Evaluative Reinforcement,* Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [@knox2008tamer]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [@macglashan2017interactive]. The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [@christiano2017deep]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless. +This method was expanded upon with more direct reward modeling [@ibarz2018reward]. TAMER was adapted to deep learning with Deep TAMER just one year later [@warnell2018deep]. This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [@leike2018scalable]. diff --git a/chapters/06-preference-data.md b/chapters/06-preference-data.md index 5d169f2..4a8a194 100644 --- a/chapters/06-preference-data.md +++ b/chapters/06-preference-data.md @@ -6,6 +6,10 @@ Getting the most out of human data involves iterative training of models, evolvi The process is difficult for new organizations trying to add human data to their pipelines. Given the sensitivity, processes that work and improve the models are extracted until the performance runs out. +## Rankings vs. Ratings + +[@likert1932technique] + ### Sourcing and Contracts The first step is sourcing the vendor to provide data (or ones own annotators). @@ -17,6 +21,10 @@ On multiple occasions, I’ve heard of data companies not delivering their data Once a contract is settled the data buyer and data provider agree upon instructions for the task(s) purchased. There are intricate documents with extensive details, corner cases, and priorities for the data. A popular example of data instructions is the one that [OpenAI released for InstructGPT](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#heading=h.21o5xkowgmpj). +An example interface is shown below from [@bai2022training]: + +![Example preference data collection interface.](images/anthropic-interface.pdf){#fig:preference-interface} + Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters. Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place. diff --git a/chapters/07-reward-models.md b/chapters/07-reward-models.md index a8bebb9..1983a4f 100644 --- a/chapters/07-reward-models.md +++ b/chapters/07-reward-models.md @@ -1,25 +1,39 @@ # Reward Modeling -TODO: Have both the InstructGPT and Anthropic loss formulations, which are slightly different +Reward models are core to the modern approach to RLHF. +Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [@sutton2018reinforcement]. +The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent's reward function given trajectories of behavior [@ng2000algorithms], and other areas of deep reinforcement learning. +Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [@leike2018scalable]. ## Training Reward Models -There are two popular expressions for how to train a reward model -- they are numerically equivalent. +There are two popular expressions for how to train a reward model -- they are numerically equivalent. +The canonical implementation is derived from the Bradley-Terry model of preference [@BradleyTerry]. +A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say $i$ and $j$, satisfy the following relation, $i > j$: -$$ -\mathcal{L}(\theta) = - \left[ \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right) \right] -$$ -[@ouyang2022training] +$$P(i > j) = \frac{p_i}{p_i + p_j}$$ {#eq:bradterry} + +To train a reward model, we must formulate a loss function that satisfies the above relation. +The first structure applied is to convert a language model into a model that outputs a scalar value, often in the form of a single classification probability logit. +Thus, we can take the score of this model with two samples, the $i$ and $j$ above are now completions, $y_1$ and $y_2$, to one prompt, $x$ and score both of them with respect to the above model, $r_\theta$. + +The probability of success for a given reward model in a pairwise comparison, becomes: + +$$P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}$$ {#eq:bradterryrm} + +Then, by taking the gradient with respect to the model parameters, we can arrive at the loss function to train a reward model. +The first form, as in [@ouyang2022training] and other works: +$$\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)$$ {#eq:rewardmodeling1} + +Second, as in [@askell2021general] and other works: +$$\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right)$$ {#eq:rewardmodeling2} -$$ -\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right) -$$ -[@askell2021general] ## Implementation Example Implementing the reward modeling loss is quite simple. More of the implementation challenge is on setting up a separate data loader and inference pipeline. +Given the correct dataloader, the loss is implemented as: ```python import torch.nn as nn rewards_chosen = model(**inputs_chosen) @@ -28,17 +42,47 @@ rewards_rejected = model(**inputs_rejected) loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean() ``` -### Further Reading +## Variants + +Reward modeling is a relatively under-explored area of RLHF. +The traditional reward modeling loss has been modified in many popular works, but the modifications have no solidified into a single best practice. + +### Preference Margin Loss + +In the case where annotators are providing either scores or rankings on a Likert Scale, the magnitude of the relational quantities can be used in training. +The most common practice is to binarize the data direction, implicitly scores of 1 and 0, but the additional information has been used to improve model training. +Llama 2 proposes using the margin between two datapoints, $m(r)$, to distinguish the magnitude of preference: + +$$\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) - m(r) \right) \right)$$ {#eq:rewardmodelingmargin} + +### Balancing Multiple Comparisons Per Prompt + +InstructGPT studies the impact of using a variable number of completions per prompt, yet balancing them in the reward model training [@ouyang2022training]. +To do this, they weight the loss updates per comparison per prompt. +At an implementation level, this can be done automatically by including all examples with the same prompt in the same training batch, naturally weighing the different pairs -- not doing this caused overfitting to the prompts. +The loss function becomes: + +$$\mathcal{L}(\theta) = - \frac{1}{(\frac{K}{2})} \mathbb{E}_{(x, y_w, y_l)\sim D} \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)$$ {#eq:rewardmodelinginstructgpt} + + +### K-wise loss function + +Starling [@zhu2023principled] https://arxiv.org/abs/2301.11270 + +## Generative Reward Modeling + +[@mahan2024generative], [@zhang2024generative], [@lambert2023entangled], generative and classifer [@ankner2024critique] + +Related to LLM-as-a-judge and other evaluator models, which are very popular + +## Further Reading reward modeling reading list imo -RewardBench (biased, but gives a good overview): https://arxiv.org/abs/2403.13787 -ArmorRM: https://arxiv.org/abs/2406.12845 -HelpSteer2: https://arxiv.org/html/2406.08673v1 -HelpSteer2-Preference: https://arxiv.org/abs/2410.01257 -Nemotron 340: https://arxiv.org/abs/2406.11704 -Llama 2: https://arxiv.org/abs/2307.09288 -Interconnects 1: https://www.interconnects.ai/p/why-reward-models-matter -Interconnects 2: https://www.interconnects.ai/p/open-rlhf-reward-models -The o.g. paper: https://arxiv.org/abs/1811.07871 -Critique out loud RMs: https://arxiv.org/abs/2408.11791 \ No newline at end of file +RewardBench (biased, but gives a good overview): [@lambert2023entangled] [@zhou2024rmb] + +New reward model training methods, with aspect-conditioned models [@wang2024interpretable], high quality human datasets [@wang2024helpsteer2] [@wang2024helpsteer2p], scaling [@adler2024nemotron], extensive experimentation [@touvron2023llama], debiasing data [@park2024offsetbias], + +## Recommendations + +Strong tendency in the literature to train for only one epoch, otherwise it overfits \ No newline at end of file diff --git a/chapters/11-policy-gradients.md b/chapters/11-policy-gradients.md index 5b6fd6c..806be8c 100644 --- a/chapters/11-policy-gradients.md +++ b/chapters/11-policy-gradients.md @@ -23,7 +23,7 @@ $$\nabla_\theta J(\pi_\theta) = \mathbb{E}_\tau \left[ \sum_{t=0}^T \nabla_\thet ### Reinforce Reinforce is a specific implementation of vanilla policy gradient that uses a Monte Carlo estimator of the gradient. - +[@ahmadian2024back] ### Proximal Policy Optimization ## Computing Policy Gradients with a Language Model diff --git a/chapters/bib.bib b/chapters/bib.bib index ebe1ebe..35b30e6 100644 --- a/chapters/bib.bib +++ b/chapters/bib.bib @@ -58,7 +58,43 @@ @article{kaufmann2023survey journal={arXiv preprint arXiv:2312.14925}, year={2023} } +@article{sutton2018reinforcement, + title={Reinforcement learning: An introduction}, + author={Sutton, Richard S}, + journal={A Bradford Book}, + year={2018} +} +@inproceedings{ng2000algorithms, + title={Algorithms for inverse reinforcement learning.}, + author={Ng, Andrew Y and Russell, Stuart and others}, + booktitle={Icml}, + volume={1}, + number={2}, + pages={2}, + year={2000} +} # RLHF Methods #################################################################### +@article{BradleyTerry, + ISSN = {00063444}, + URL = {http://www.jstor.org/stable/2334029}, + author = {Ralph Allan Bradley and Milton E. Terry}, + journal = {Biometrika}, + number = {3/4}, + pages = {324--345}, + publisher = {[Oxford University Press, Biometrika Trust]}, + title = {Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons}, + urldate = {2023-02-13}, + volume = {39}, + year = {1952} +} + +@article{likert1932technique, + title={A technique for the measurement of attitudes.}, + author={Likert, Rensis}, + journal={Archives of psychology}, + year={1932} +} + @article{gilks1992adaptive, title={Adaptive rejection sampling for Gibbs sampling}, author={Gilks, Walter R and Wild, Pascal}, @@ -69,7 +105,77 @@ @article{gilks1992adaptive year={1992}, publisher={Wiley Online Library} } +@article{ahmadian2024back, + title={Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms}, + author={Ahmadian, Arash and Cremer, Chris and Gall{\'e}, Matthias and Fadaee, Marzieh and Kreutzer, Julia and {\"U}st{\"u}n, Ahmet and Hooker, Sara}, + journal={arXiv preprint arXiv:2402.14740}, + year={2024} +} +################################################################################################ +# Reward Modeling More #################################################################### +@article{zhou2024rmb, + title={RMB: Comprehensively Benchmarking Reward Models in LLM Alignment}, + author={Zhou, Enyu and Zheng, Guodong and Wang, Binghai and Xi, Zhiheng and Dou, Shihan and Bao, Rong and Shen, Wei and Xiong, Limao and Fan, Jessica and Mou, Yurong and others}, + journal={arXiv preprint arXiv:2410.09893}, + year={2024} +} +@inproceedings{zhu2023principled, + title={Principled reinforcement learning with human feedback from pairwise or k-wise comparisons}, + author={Zhu, Banghua and Jordan, Michael and Jiao, Jiantao}, + booktitle={International Conference on Machine Learning}, + pages={43037--43067}, + year={2023}, + organization={PMLR} +} +@article{wang2024interpretable, + title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, + author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong}, + journal={arXiv preprint arXiv:2406.12845}, + year={2024} +} +@article{zhang2024generative, + title={Generative verifiers: Reward modeling as next-token prediction}, + author={Zhang, Lunjun and Hosseini, Arian and Bansal, Hritik and Kazemi, Mehran and Kumar, Aviral and Agarwal, Rishabh}, + journal={arXiv preprint arXiv:2408.15240}, + year={2024} +} +@article{mahan2024generative, + title={Generative Reward Models}, + author={Mahan, Dakota and Phung, Duy Van and Rafailov, Rafael and Blagden, Chase and Lile, Nathan and Castricato, Louis and Franken, Jan-Philipp and Finn, Chelsea and Albalak, Alon}, + year={2024}, + url={https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf} +} +@article{wang2024helpsteer2, + title={HelpSteer2: Open-source dataset for training top-performing reward models}, + author={Wang, Zhilin and Dong, Yi and Delalleau, Olivier and Zeng, Jiaqi and Shen, Gerald and Egert, Daniel and Zhang, Jimmy J and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii}, + journal={arXiv preprint arXiv:2406.08673}, + year={2024} +} +@article{wang2024helpsteer2p, + title={HelpSteer2-Preference: Complementing Ratings with Preferences}, + author={Wang, Zhilin and Bukharin, Alexander and Delalleau, Olivier and Egert, Daniel and Shen, Gerald and Zeng, Jiaqi and Kuchaiev, Oleksii and Dong, Yi}, + journal={arXiv preprint arXiv:2410.01257}, + year={2024} +} +@article{adler2024nemotron, + title={Nemotron-4 340B Technical Report}, + author={Adler, Bo and Agarwal, Niket and Aithal, Ashwath and Anh, Dong H and Bhattacharya, Pallab and Brundyn, Annika and Casper, Jared and Catanzaro, Bryan and Clay, Sharon and Cohen, Jonathan and others}, + journal={arXiv preprint arXiv:2406.11704}, + year={2024} +} +@article{ankner2024critique, + title={Critique-out-loud reward models}, + author={Ankner, Zachary and Paul, Mansheej and Cui, Brandon and Chang, Jonathan D and Ammanabrolu, Prithviraj}, + journal={arXiv preprint arXiv:2408.11791}, + year={2024} +} +@article{park2024offsetbias, + title={Offsetbias: Leveraging debiased data for tuning evaluators}, + author={Park, Junsoo and Jwa, Seungyeon and Ren, Meiying and Kim, Daeyoung and Choi, Sanghyuk}, + journal={arXiv preprint arXiv:2407.06551}, + year={2024} +} ################################################################################################ # KL Refs #################################################################### @@ -123,6 +229,13 @@ @article{christiano2017deep volume={30}, year={2017} } +@article{ibarz2018reward, + title={Reward learning from human preferences and demonstrations in atari}, + author={Ibarz, Borja and Leike, Jan and Pohlen, Tobias and Irving, Geoffrey and Legg, Shane and Amodei, Dario}, + journal={Advances in neural information processing systems}, + volume={31}, + year={2018} +} @article{leike2018scalable, title={Scalable agent alignment via reward modeling: a research direction}, author={Leike, Jan and Krueger, David and Everitt, Tom and Martic, Miljan and Maini, Vishal and Legg, Shane}, diff --git a/images/anthropic-interface.pdf b/images/anthropic-interface.pdf new file mode 100644 index 0000000..39d4a93 Binary files /dev/null and b/images/anthropic-interface.pdf differ