Skip to content

Commit

Permalink
Reward Modeling (and other tweaks) (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
natolambert authored Oct 19, 2024
1 parent e12e698 commit aecf8ed
Show file tree
Hide file tree
Showing 6 changed files with 188 additions and 22 deletions.
1 change: 1 addition & 0 deletions chapters/04-related-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Still, many of the techniques used today are deeply related to core techniques f
*TAMER: Training an Agent Manually via Evaluative Reinforcement,* Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [@knox2008tamer]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [@macglashan2017interactive].

The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [@christiano2017deep]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless.
This method was expanded upon with more direct reward modeling [@ibarz2018reward].
TAMER was adapted to deep learning with Deep TAMER just one year later [@warnell2018deep].

This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [@leike2018scalable].
Expand Down
8 changes: 8 additions & 0 deletions chapters/06-preference-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Getting the most out of human data involves iterative training of models, evolvi
The process is difficult for new organizations trying to add human data to their pipelines.
Given the sensitivity, processes that work and improve the models are extracted until the performance runs out.

## Rankings vs. Ratings

[@likert1932technique]

### Sourcing and Contracts

The first step is sourcing the vendor to provide data (or ones own annotators).
Expand All @@ -17,6 +21,10 @@ On multiple occasions, I’ve heard of data companies not delivering their data

Once a contract is settled the data buyer and data provider agree upon instructions for the task(s) purchased. There are intricate documents with extensive details, corner cases, and priorities for the data. A popular example of data instructions is the one that [OpenAI released for InstructGPT](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#heading=h.21o5xkowgmpj).

An example interface is shown below from [@bai2022training]:

![Example preference data collection interface.](images/anthropic-interface.pdf){#fig:preference-interface}

Depending on the domains of interest in the data, timelines for when the data can be labeled or curated vary. High-demand areas like mathematical reasoning or coding must be locked into a schedule weeks out. Simple delays of data collection don’t always work — Scale AI et al. are managing their workforces like AI research labs manage the compute-intensive jobs on their clusters.

Once everything is agreed upon, the actual collection process is a high-stakes time for post-training teams. All the infrastructure, evaluation tools, and plans for how to use the data and make downstream decisions must be in place.
Expand Down
86 changes: 65 additions & 21 deletions chapters/07-reward-models.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@
# Reward Modeling

TODO: Have both the InstructGPT and Anthropic loss formulations, which are slightly different
Reward models are core to the modern approach to RLHF.
Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [@sutton2018reinforcement].
The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent's reward function given trajectories of behavior [@ng2000algorithms], and other areas of deep reinforcement learning.
Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [@leike2018scalable].

## Training Reward Models

There are two popular expressions for how to train a reward model -- they are numerically equivalent.
There are two popular expressions for how to train a reward model -- they are numerically equivalent.
The canonical implementation is derived from the Bradley-Terry model of preference [@BradleyTerry].
A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say $i$ and $j$, satisfy the following relation, $i > j$:

$$
\mathcal{L}(\theta) = - \left[ \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right) \right]
$$
[@ouyang2022training]
$$P(i > j) = \frac{p_i}{p_i + p_j}$$ {#eq:bradterry}

To train a reward model, we must formulate a loss function that satisfies the above relation.
The first structure applied is to convert a language model into a model that outputs a scalar value, often in the form of a single classification probability logit.
Thus, we can take the score of this model with two samples, the $i$ and $j$ above are now completions, $y_1$ and $y_2$, to one prompt, $x$ and score both of them with respect to the above model, $r_\theta$.

The probability of success for a given reward model in a pairwise comparison, becomes:

$$P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}$$ {#eq:bradterryrm}

Then, by taking the gradient with respect to the model parameters, we can arrive at the loss function to train a reward model.
The first form, as in [@ouyang2022training] and other works:
$$\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)$$ {#eq:rewardmodeling1}

Second, as in [@askell2021general] and other works:
$$\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right)$$ {#eq:rewardmodeling2}

$$
\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right)
$$
[@askell2021general]

## Implementation Example

Implementing the reward modeling loss is quite simple.
More of the implementation challenge is on setting up a separate data loader and inference pipeline.
Given the correct dataloader, the loss is implemented as:
```python
import torch.nn as nn
rewards_chosen = model(**inputs_chosen)
Expand All @@ -28,17 +42,47 @@ rewards_rejected = model(**inputs_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
```

### Further Reading
## Variants

Reward modeling is a relatively under-explored area of RLHF.
The traditional reward modeling loss has been modified in many popular works, but the modifications have no solidified into a single best practice.

### Preference Margin Loss

In the case where annotators are providing either scores or rankings on a Likert Scale, the magnitude of the relational quantities can be used in training.
The most common practice is to binarize the data direction, implicitly scores of 1 and 0, but the additional information has been used to improve model training.
Llama 2 proposes using the margin between two datapoints, $m(r)$, to distinguish the magnitude of preference:

$$\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) - m(r) \right) \right)$$ {#eq:rewardmodelingmargin}

### Balancing Multiple Comparisons Per Prompt

InstructGPT studies the impact of using a variable number of completions per prompt, yet balancing them in the reward model training [@ouyang2022training].
To do this, they weight the loss updates per comparison per prompt.
At an implementation level, this can be done automatically by including all examples with the same prompt in the same training batch, naturally weighing the different pairs -- not doing this caused overfitting to the prompts.
The loss function becomes:

$$\mathcal{L}(\theta) = - \frac{1}{(\frac{K}{2})} \mathbb{E}_{(x, y_w, y_l)\sim D} \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)$$ {#eq:rewardmodelinginstructgpt}


### K-wise loss function

Starling [@zhu2023principled] https://arxiv.org/abs/2301.11270

## Generative Reward Modeling

[@mahan2024generative], [@zhang2024generative], [@lambert2023entangled], generative and classifer [@ankner2024critique]

Related to LLM-as-a-judge and other evaluator models, which are very popular

## Further Reading

reward modeling reading list imo

RewardBench (biased, but gives a good overview): https://arxiv.org/abs/2403.13787
ArmorRM: https://arxiv.org/abs/2406.12845
HelpSteer2: https://arxiv.org/html/2406.08673v1
HelpSteer2-Preference: https://arxiv.org/abs/2410.01257
Nemotron 340: https://arxiv.org/abs/2406.11704
Llama 2: https://arxiv.org/abs/2307.09288
Interconnects 1: https://www.interconnects.ai/p/why-reward-models-matter
Interconnects 2: https://www.interconnects.ai/p/open-rlhf-reward-models
The o.g. paper: https://arxiv.org/abs/1811.07871
Critique out loud RMs: https://arxiv.org/abs/2408.11791
RewardBench (biased, but gives a good overview): [@lambert2023entangled] [@zhou2024rmb]

New reward model training methods, with aspect-conditioned models [@wang2024interpretable], high quality human datasets [@wang2024helpsteer2] [@wang2024helpsteer2p], scaling [@adler2024nemotron], extensive experimentation [@touvron2023llama], debiasing data [@park2024offsetbias],

## Recommendations

Strong tendency in the literature to train for only one epoch, otherwise it overfits
2 changes: 1 addition & 1 deletion chapters/11-policy-gradients.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ $$\nabla_\theta J(\pi_\theta) = \mathbb{E}_\tau \left[ \sum_{t=0}^T \nabla_\thet
### Reinforce

Reinforce is a specific implementation of vanilla policy gradient that uses a Monte Carlo estimator of the gradient.

[@ahmadian2024back]
### Proximal Policy Optimization

## Computing Policy Gradients with a Language Model
Expand Down
113 changes: 113 additions & 0 deletions chapters/bib.bib
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,43 @@ @article{kaufmann2023survey
journal={arXiv preprint arXiv:2312.14925},
year={2023}
}
@article{sutton2018reinforcement,
title={Reinforcement learning: An introduction},
author={Sutton, Richard S},
journal={A Bradford Book},
year={2018}
}
@inproceedings{ng2000algorithms,
title={Algorithms for inverse reinforcement learning.},
author={Ng, Andrew Y and Russell, Stuart and others},
booktitle={Icml},
volume={1},
number={2},
pages={2},
year={2000}
}
# RLHF Methods ####################################################################
@article{BradleyTerry,
ISSN = {00063444},
URL = {http://www.jstor.org/stable/2334029},
author = {Ralph Allan Bradley and Milton E. Terry},
journal = {Biometrika},
number = {3/4},
pages = {324--345},
publisher = {[Oxford University Press, Biometrika Trust]},
title = {Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons},
urldate = {2023-02-13},
volume = {39},
year = {1952}
}

@article{likert1932technique,
title={A technique for the measurement of attitudes.},
author={Likert, Rensis},
journal={Archives of psychology},
year={1932}
}

@article{gilks1992adaptive,
title={Adaptive rejection sampling for Gibbs sampling},
author={Gilks, Walter R and Wild, Pascal},
Expand All @@ -69,7 +105,77 @@ @article{gilks1992adaptive
year={1992},
publisher={Wiley Online Library}
}
@article{ahmadian2024back,
title={Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms},
author={Ahmadian, Arash and Cremer, Chris and Gall{\'e}, Matthias and Fadaee, Marzieh and Kreutzer, Julia and {\"U}st{\"u}n, Ahmet and Hooker, Sara},
journal={arXiv preprint arXiv:2402.14740},
year={2024}
}
################################################################################################
# Reward Modeling More ####################################################################
@article{zhou2024rmb,
title={RMB: Comprehensively Benchmarking Reward Models in LLM Alignment},
author={Zhou, Enyu and Zheng, Guodong and Wang, Binghai and Xi, Zhiheng and Dou, Shihan and Bao, Rong and Shen, Wei and Xiong, Limao and Fan, Jessica and Mou, Yurong and others},
journal={arXiv preprint arXiv:2410.09893},
year={2024}
}
@inproceedings{zhu2023principled,
title={Principled reinforcement learning with human feedback from pairwise or k-wise comparisons},
author={Zhu, Banghua and Jordan, Michael and Jiao, Jiantao},
booktitle={International Conference on Machine Learning},
pages={43037--43067},
year={2023},
organization={PMLR}
}
@article{wang2024interpretable,
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
journal={arXiv preprint arXiv:2406.12845},
year={2024}
}
@article{zhang2024generative,
title={Generative verifiers: Reward modeling as next-token prediction},
author={Zhang, Lunjun and Hosseini, Arian and Bansal, Hritik and Kazemi, Mehran and Kumar, Aviral and Agarwal, Rishabh},
journal={arXiv preprint arXiv:2408.15240},
year={2024}
}
@article{mahan2024generative,
title={Generative Reward Models},
author={Mahan, Dakota and Phung, Duy Van and Rafailov, Rafael and Blagden, Chase and Lile, Nathan and Castricato, Louis and Franken, Jan-Philipp and Finn, Chelsea and Albalak, Alon},
year={2024},
url={https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf}
}
@article{wang2024helpsteer2,
title={HelpSteer2: Open-source dataset for training top-performing reward models},
author={Wang, Zhilin and Dong, Yi and Delalleau, Olivier and Zeng, Jiaqi and Shen, Gerald and Egert, Daniel and Zhang, Jimmy J and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii},
journal={arXiv preprint arXiv:2406.08673},
year={2024}
}
@article{wang2024helpsteer2p,
title={HelpSteer2-Preference: Complementing Ratings with Preferences},
author={Wang, Zhilin and Bukharin, Alexander and Delalleau, Olivier and Egert, Daniel and Shen, Gerald and Zeng, Jiaqi and Kuchaiev, Oleksii and Dong, Yi},
journal={arXiv preprint arXiv:2410.01257},
year={2024}
}
@article{adler2024nemotron,
title={Nemotron-4 340B Technical Report},
author={Adler, Bo and Agarwal, Niket and Aithal, Ashwath and Anh, Dong H and Bhattacharya, Pallab and Brundyn, Annika and Casper, Jared and Catanzaro, Bryan and Clay, Sharon and Cohen, Jonathan and others},
journal={arXiv preprint arXiv:2406.11704},
year={2024}
}
@article{ankner2024critique,
title={Critique-out-loud reward models},
author={Ankner, Zachary and Paul, Mansheej and Cui, Brandon and Chang, Jonathan D and Ammanabrolu, Prithviraj},
journal={arXiv preprint arXiv:2408.11791},
year={2024}
}
@article{park2024offsetbias,
title={Offsetbias: Leveraging debiased data for tuning evaluators},
author={Park, Junsoo and Jwa, Seungyeon and Ren, Meiying and Kim, Daeyoung and Choi, Sanghyuk},
journal={arXiv preprint arXiv:2407.06551},
year={2024}
}
################################################################################################
# KL Refs ####################################################################
Expand Down Expand Up @@ -123,6 +229,13 @@ @article{christiano2017deep
volume={30},
year={2017}
}
@article{ibarz2018reward,
title={Reward learning from human preferences and demonstrations in atari},
author={Ibarz, Borja and Leike, Jan and Pohlen, Tobias and Irving, Geoffrey and Legg, Shane and Amodei, Dario},
journal={Advances in neural information processing systems},
volume={31},
year={2018}
}
@article{leike2018scalable,
title={Scalable agent alignment via reward modeling: a research direction},
author={Leike, Jan and Krueger, David and Everitt, Tom and Martic, Miljan and Maini, Vishal and Legg, Shane},
Expand Down
Binary file added images/anthropic-interface.pdf
Binary file not shown.

0 comments on commit aecf8ed

Please sign in to comment.