Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3 family of models does not seem to work with RewardTrainer #2758

Open
5 tasks done
JohnGiorgi opened this issue Feb 4, 2025 · 1 comment
Open
5 tasks done
Labels
⚡accelerate Related to accelerate ⚡ PEFT Related to PEFT 🏋 Reward Related to Reward modelling

Comments

@JohnGiorgi
Copy link
Contributor

JohnGiorgi commented Feb 4, 2025

Reproduction

I noticed, when trying to train Llama3.1-8B with the RewardTrainer for my own problem, that seemingly no matter what I tried I couldn't get it to converge. Simply by switching Llama3.1 for Qwen2.5 (tried sizes from 0.5B --> 7B), the model converged without issue. I kept all hyperparameters the same and used the same data (unfortunately I cannot share it).

To reproduce, I ran the official reward_modeling.py script examples with Qwen2-0.5B vs. Llama3.1-1B (both instruct):

# Qwen training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Qwen2-0.5B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

# Llama training job
accelerate launch \
    --config_file="./conf/accelerate_configs/multi_gpu.yaml" \
    --num_processes 8 \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir output/Llama-3.2-1B-Reward \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --max_length 2048

Both were trained on 1 node of 8xA6000s with the following accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Note

I did have to add the line tokenizer.pad_token = tokenizer.eos_token in reward_modeling.py right after the tokenizer and model initialization, similar to how its done in sft.py, because llama does not have a pad token (auxiliary point: maybe the reward_modeling.py script should do this when there is no pad token? happy to PR if so)

Sure enough, I see the same issues with convergence:

Image Image

I don't know if this is a known issue but I wanted to flag incase it is (and someone knows the fix) or it isn't and I am doing something dumb someone would be kind enough to point out!

System Info

  • Platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.5.1
  • CUDA device(s): not available
  • Transformers version: 4.48.2
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.25.2
  • TRL version: 0.14.0
  • bitsandbytes version: 0.45.1
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.60.2
  • PEFT version: 0.14.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added ⚡ PEFT Related to PEFT 🏋 Reward Related to Reward modelling ⚡accelerate Related to accelerate labels Feb 4, 2025
@JohnGiorgi
Copy link
Contributor Author

Huh, it looks like it comes down to what you use as the pad token itself. If I used one of Llamas unused special tokens: <|reserved_special_token_0|> (pad token id 128002), it works! If I use the EOS token as padding, it doesn't...

Image

Another callout is that in either cases, it looks like we somehow end up with two BOS tokens in the chosen/rejected pairs:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡accelerate Related to accelerate ⚡ PEFT Related to PEFT 🏋 Reward Related to Reward modelling
Projects
None yet
Development

No branches or pull requests

1 participant