nanoRLHF

RLHF experiments on a single A100 40G GPU.

Updates

2025.02.02: A simple DeepSeek R1-Zero reproducing code. To enable efficient training for sequences up to 8000 tokens long, we introduce "sparse GRPO" and implement dynamic mini-batching.

Features

Compared to trl, nanoRLHF

is much more efficient running on a single GPU and allows for bigger models. This is done by using vllm to generate samples, model alternate offloadings, LoRA and some minor modifications of trl codes.
provides GRPO, ReMax and RAFT implementations, and a slightly different RLOO where we abandon some examples to save time.
seletively enables advantage whiten according to the algorithm you choose. To my understanding, advantage whiten is used to provide a simple dynamic baseline for advantage function. For algorithms like GRPO, PPO, RLOO, Remax, which themself have provided a baseline, we disable the advantage whiten by default. For Reinforce, we enable it.
uses a changing random seed in vllm generation to avoid overfitting.
provides a more flexible reward function design, easier to be customized for rule-based reward.
provides value model initialization for PPO, this is extremely crutial for successful training of PPO, especially for rule-based reward, where you have no reward model to initialize your value model. That might cause the training to be compeletely stagnant right from the start.

Usage

Take GRPO for example,

cd nanoRLHF/GRPO

python grpo.py

Default Setting

policy model : Qwen/Qwen2.5-1.5B-Instruct

reward model/function : OpenAssistant/reward-model-deberta-v3-large-v2

max_reponse_length: 1500

dataset: Anthropic/hh-rlhf

...

ALL setting is on the file you run.

Performance

The training throughput is approximately 1s /episode with the default settings. Reward results are as follows (not finished the whole run, report is here):

Acknowledgement

The code is adapted from trl, but way more efficient, more flexible reward function, specially designed for researchers that want to try small RLHF experiments quick on a single GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
GRPO		GRPO
PPO		PPO
RAFT		RAFT
REINFORCE		REINFORCE
RLOO		RLOO
ReMax		ReMax
docs		docs
examples/r1-v0		examples/r1-v0
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoRLHF

Updates

Features

Usage

Default Setting

Performance

Acknowledgement

About

Releases

Packages

Languages

License

jackfsuia/nanoRLHF

Folders and files

Latest commit

History

Repository files navigation

nanoRLHF

Updates

Features

Usage

Default Setting

Performance

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages