Name		Name	Last commit message	Last commit date
parent directory ..
utils		utils
README.md		README.md
aime24.png		aime24.png
grpo_r1.py		grpo_r1.py
grpo_r1_trainer.py		grpo_r1_trainer.py
len.png		len.png
math500.png		math500.png
rewar.png		rewar.png

README.md

deepSeek R1-Zero reproduing attempts:

In this folder, we try to repoduce the figures of test time scaling during training of r1 zero in their paper (that is, their figure 2 and figure 3) on a small model. The a-ha moment of self-check doesn't matter at this point for two reasons:

Self-checks or self relections always exist in base model, without any finetuning or post training. This can be easily verified by sampling enough times and searching words like 'mistake', 'However' in model's responses. As we previously observed from QWEN2.5 math, about 0.5% percent of responses have self-reflections.
However, large scale ubiquitous self-relections like o1 and r1 have not been observed in my post-training for small models.

We use Qwen-2.5-1.5B as the base model, and use meta-math/MetaMathQA as train dataset. Then we use GRPO (actually a sparse version of it, as will be explained below) to train the base model, the reward is 1 if the model gets the correct answer, and 0 otherwise. Our results are as follows:

Accuracy on MATH-500 (it didn't improve a lot on AIME as the original paper): Response length (tokens): Reward:

Usage

This code runs on a single A100 40G. To reproduce, run

cd nanoRLHF/examples/r1-v0

python grpo_r1.py

Features

Sparse GRPO：We abandon the samples whose advantages are 0. That saved us a lot of time without losing much performance.
To save more time, we implement dynamic batching for the sample generating and the minibatch forwarding and backwarding.

These features enable efficient training of Qwen-1.5B with a response length of 8000 tokens.

To do

We have been trying to reproduce large scale ubiquitous self-relections in small models. This one experiment is just a small piece of effort for show.

Citation

If this work is helpful, please kindly cite as:

@article{nanoRLHF/r1-v0,
  title={nanoRLHF/r1-v0: deepSeek R1-Zero reproduing attempts}, 
  author={Yannan Luo},
  year={2025},
  url={https://github.com/jackfsuia/nanoRLHF/edit/main/examples/r1-v0}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r1-v0

r1-v0

README.md

deepSeek R1-Zero reproduing attempts:

Usage

Features

To do

Citation

Files

r1-v0

Directory actions

More options

Directory actions

More options

Latest commit

History

r1-v0

Folders and files

parent directory

README.md

deepSeek R1-Zero reproduing attempts:

Usage

Features

To do

Citation