RLHF experiments on a single A100 40G GPU.
2025.02.02: A simple DeepSeek R1-Zero reproducing code. To enable efficient training for sequences up to 8000 tokens long, we introduce "sparse GRPO" and implement dynamic mini-batching.
Compared to trl, nanoRLHF
- is much more efficient running on a single GPU and allows for bigger models. This is done by using vllm to generate samples, model alternate offloadings, LoRA and some minor modifications of trl codes.
- provides GRPO, ReMax and RAFT implementations, and a slightly different RLOO where we abandon some examples to save time.
- seletively enables advantage whiten according to the algorithm you choose. To my understanding, advantage whiten is used to provide a simple dynamic baseline for advantage function. For algorithms like GRPO, PPO, RLOO, Remax, which themself have provided a baseline, we disable the advantage whiten by default. For Reinforce, we enable it.
- uses a changing random seed in vllm generation to avoid overfitting.
- provides a more flexible reward function design, easier to be customized for rule-based reward.
- provides value model initialization for PPO, this is extremely crutial for successful training of PPO, especially for rule-based reward, where you have no reward model to initialize your value model. That might cause the training to be compeletely stagnant right from the start.
Take GRPO for example,
cd nanoRLHF/GRPO
python grpo.py
policy model : Qwen/Qwen2.5-1.5B-Instruct
reward model/function : OpenAssistant/reward-model-deberta-v3-large-v2
max_reponse_length: 1500
dataset: Anthropic/hh-rlhf
...
ALL setting is on the file you run.
The training throughput is approximately 1s /episode with the default settings. Reward results are as follows (not finished the whole run, report is here):
The code is adapted from trl, but way more efficient, more flexible reward function, specially designed for researchers that want to try small RLHF experiments quick on a single GPU.