Skip to content

Commit

Permalink
Rejection Sampling + Run Again (#7)
Browse files Browse the repository at this point in the history
  • Loading branch information
natolambert authored Aug 11, 2024
1 parent e385796 commit 18d33da
Show file tree
Hide file tree
Showing 8 changed files with 70 additions and 9 deletions.
4 changes: 0 additions & 4 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,6 @@ jobs:
run: |
echo "/Library/TeX/texbin" >> $GITHUB_PATH
echo "PATH=$PATH:/Library/TeX/texbin" >> $GITHUB_ENV
xelatex --version # Verify xelatex is accessible
if ! command -v xelatex &> /dev/null; then
sudo ln -s /Library/TeX/texbin/xelatex /usr/local/bin/xelatex
fi
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ PANDOC_COMMAND = pandoc
DOCX_ARGS = --standalone --reference-doc templates/docx.docx
EPUB_ARGS = --template templates/epub.html --epub-cover-image $(COVER_IMAGE)
HTML_ARGS = --template templates/html.html --standalone --to html5
PDF_ARGS = --template templates/pdf.latex --pdf-engine xelatex
PDF_ARGS = --template templates/pdf.tex --pdf-engine xelatex
NESTED_HTML_TEMPLATE = templates/chapter.html

# Per-format file dependencies
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ sudo apt-get install texlive-fonts-recommended texlive-xetex
brew install pandoc
brew install make
```
(See below for `pandoc-crossref`)

### Folder structure

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Installation
# Optimizaiton - Overview

This is the installation chapter.
We love the book [@russell2016artificial].
Expand Down
14 changes: 14 additions & 0 deletions chapters/03-opt-rejection-sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Rejection Sampling

Rejection Sampling (RS) is a popular and simple baseline for performing preference fine-tuning.
Rejection sampling operates by curating new candidate instructions, filtering them based on a trained reward model, and then fine-tuning the original model only on the top completions.

The name originates from computational statistics [@gilks1992adaptive], where one wishes to sample from a complex distribution, but does not have a direct method to do so.
To alleviate this, one samples from a simpler to model distribution and uses a heuristic to check if the sample is permissible.
With language models, the target distribution is high-quality answers to instructions, the filter is a reward model, and the sampling distribution is the current model.

## Related works

Many prominent RLHF and preference fine-tuning papers have used rejection sampling as a baseling, but a canonical implementation and documentation does not exist

WebGPT [@nakano2021webgpt], Anthropic's Helpful and Harmless agent[@bai2022training], OpenAI's popular paper on process reward models [@lightman2023let], Llama 2 Chat models [@touvron2023llama], and other seminal works all use this baseline.
3 changes: 0 additions & 3 deletions chapters/03-usage.md

This file was deleted.

39 changes: 39 additions & 0 deletions chapters/bib.bib
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,43 @@ @book{russell2016artificial
author={Russell, Stuart J and Norvig, Peter},
year={2016},
publisher={Pearson}
}

@article{gilks1992adaptive,
title={Adaptive rejection sampling for Gibbs sampling},
author={Gilks, Walter R and Wild, Pascal},
journal={Journal of the Royal Statistical Society: Series C (Applied Statistics)},
volume={41},
number={2},
pages={337--348},
year={1992},
publisher={Wiley Online Library}
}

@article{nakano2021webgpt,
title={Webgpt: Browser-assisted question-answering with human feedback},
author={Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others},
journal={arXiv preprint arXiv:2112.09332},
year={2021}
}

@article{bai2022training,
title={Training a helpful and harmless assistant with reinforcement learning from human feedback},
author={Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others},
journal={arXiv preprint arXiv:2204.05862},
year={2022}
}

@article{lightman2023let,
title={Let's verify step by step},
author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
journal={arXiv preprint arXiv:2305.20050},
year={2023}
}

@article{touvron2023llama,
title={Llama 2: Open foundation and fine-tuned chat models},
author={Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others},
journal={arXiv preprint arXiv:2307.09288},
year={2023}
}
14 changes: 14 additions & 0 deletions templates/pdf.latex → templates/pdf.tex
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,18 @@
$if(indent)$
$else$
\makeatletter
% new code here
\newsavebox\pandoc@box
\newcommand*\pandocbounded[1]{%
\sbox\pandoc@box{#1}%
\Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}%
\Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}%
\ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi%
\ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}%
\else\usebox{\pandoc@box}%
\fi%
}

\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
Expand Down Expand Up @@ -427,6 +439,8 @@
$endif$
$endif$



\begin{document}
$if(has-frontmatter)$
\frontmatter
Expand Down

0 comments on commit 18d33da

Please sign in to comment.