Llama 3 family of models does not seem to work with RewardTrainer #2758

JohnGiorgi · 2025-02-04T02:18:46Z

Reproduction

I noticed, when trying to train Llama3.1-8B with the RewardTrainer for my own problem, that seemingly no matter what I tried I couldn't get it to converge. Simply by switching Llama3.1 for Qwen2.5 (tried sizes from 0.5B --> 7B), the model converged without issue. I kept all hyperparameters the same and used the same data (unfortunately I cannot share it).

To reproduce, I ran the official reward_modeling.py script examples with Qwen2-0.5B vs. Llama3.1-1B (both instruct):

# Qwen training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Qwen2-0.5B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

# Llama training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Llama-3.2-1B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

Both were trained on 1 node of 8xA6000s with the following accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Note

I did have to add the line tokenizer.pad_token = tokenizer.eos_token in reward_modeling.py right after the tokenizer and model initialization, similar to how its done in sft.py, because llama does not have a pad token (auxiliary point: maybe the reward_modeling.py script should do this when there is no pad token? happy to PR if so)

Sure enough, I see the same issues with convergence:

I don't know if this is a known issue but I wanted to flag incase it is (and someone knows the fix) or it isn't and I am doing something dumb someone would be kind enough to point out!

System Info

Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.5.1
CUDA device(s): not available
Transformers version: 4.48.2
Accelerate version: 1.0.1
Accelerate config: not found
Datasets version: 3.2.0
HF Hub version: 0.25.2
TRL version: 0.14.0
bitsandbytes version: 0.45.1
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: 1.60.2
PEFT version: 0.14.0

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

JohnGiorgi · 2025-02-04T16:09:37Z

Huh, it looks like it comes down to what you use as the pad token itself. If I used one of Llamas unused special tokens: <|reserved_special_token_0|> (pad token id 128002), it works! If I use the EOS token as padding, it doesn't...

Another callout is that in either cases, it looks like we somehow end up with two BOS tokens in the chosen/rejected pairs:

github-actions bot added ⚡ PEFT Related to PEFT 🏋 Reward Related to Reward modelling ⚡accelerate Related to accelerate labels Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 family of models does not seem to work with RewardTrainer #2758

Llama 3 family of models does not seem to work with RewardTrainer #2758

JohnGiorgi commented Feb 4, 2025 •

edited

Loading

JohnGiorgi commented Feb 4, 2025

Llama 3 family of models does not seem to work with RewardTrainer #2758

Llama 3 family of models does not seem to work with RewardTrainer #2758

Comments

JohnGiorgi commented Feb 4, 2025 • edited Loading

Reproduction

System Info

Checklist

JohnGiorgi commented Feb 4, 2025

JohnGiorgi commented Feb 4, 2025 •

edited

Loading