Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding grpo training #1233

Draft
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

Goekdeniz-Guelmez
Copy link
Contributor

No description provided.

@mark-lord
Copy link

mark-lord commented Feb 2, 2025

Absolute HERO! Been trying to figure this out myself the past week but made pretty much no progress whatsoever, other than to make a script that fills up all the RAM on my Mac 🤣

Is there any way to run this yet? I assume no since at the mo it's still marked as in draft + there isn't a lora_config.yaml like in the DPO example yet (not sure if it's needed)?

@Goekdeniz-Guelmez
Copy link
Contributor Author

No, not yet I still have to implement the Dataset Wrapper and some other stuff, I'll tell you when it's done.

Copy link

@Guo-astro Guo-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible need to use expanded_prompts, expanded_answers in both reward and loss

llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved
@Goekdeniz-Guelmez
Copy link
Contributor Author

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-0.5B \
    --train \
    --data /Users/gokdenizgulmez/Desktop/test_grpo \
    --iters 5 \
    --batch-size 1 \
    --num-layers 4 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Desktop/test-grpo-full \
    --max-seq-length 128 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2

Output

Loading pretrained model
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 124936.71it/s]
Loading datasets
Training
Trainable parameters: 0.109% (0.541M/494.033M)
Starting GRPO training with 5 reward functions..., iters: 5
[WARNING] Some prompts are longer than 128 tokens. Long prompts will be truncated.
Iter 1: Val loss 0.00000140, Val total_rewards_mean -0.359, Val total_rewards_std 0.010, Val grouped_rewards_mean -0.359, Val grouped_rewards_std 0.010, Val kl 0.000, Val reward_func_0_mean 0.000, Val reward_func_0_std 0.000, Val reward_func_1_mean 0.000, Val reward_func_1_std 0.000, Val reward_func_2_mean 0.000, Val reward_func_2_std 0.000, Val reward_func_3_mean 0.000, Val reward_func_3_std 0.000, Val reward_func_4_mean -1.794, Val reward_func_4_std 0.051, Val took 8.385s

But after that my 32 GB of ram get fully used. I tried to add some memory optimisations but the memory usage is still too much.

@Goekdeniz-Guelmez
Copy link
Contributor Author

Iter 1: Val loss -0.00000057, Val total_rewards_mean -0.387, Val total_rewards_std 0.026, Val grouped_rewards_mean -0.387, Val grouped_rewards_std 0.026, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.937, Val r1_count_xml_std 0.128, Val took 8.314s

Still uses too much memory.

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Feb 3, 2025

So I tried using trl and the same amount of ram has been used, so no error on my side

@mark-lord
Copy link

🚀

Would you be able to share the datasets you used for the training? Will give it a go on my machine as soon as I can 🙌

@Goekdeniz-Guelmez
Copy link
Contributor Author

Will do that tomorrow 🤝

@Guo-astro
Copy link

🚀

Would you be able to share the datasets you used for the training? Will give it a go on my machine as soon as I can 🙌

I created a quick one only for testing the code

https://huggingface.co/datasets/Goastro/mlx-grpo-dataset

@Goekdeniz-Guelmez
Copy link
Contributor Author

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-0.5B \
    --train \
    --data /Users/gokdenizgulmez/Desktop/test_grpo \
    --iters 5 \
    --batch-size 1 \
    --num-layers 8 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Desktop/test-grpo-full \
    --max-seq-length 255 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --max-completion-length 6

Output:

Loading pretrained model
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 72853.92it/s]
Loading datasets
Training
Trainable parameters: 0.109% (0.541M/494.033M)
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10955.27it/s]
Starting GRPO training with 5 reward functions..., iters: 5
Iter 1: Val loss 0.00000000, Val total_rewards_mean -0.354, Val total_rewards_std 0.012, Val grouped_rewards_mean -0.354, Val grouped_rewards_std 0.012, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.769, Val r1_count_xml_std 0.060, Val took 26.298s
Iter 1: Train loss -0.00001353, Total rewards mean -0.306, Total rewards std 0.001, Grouped rewards mean -0.306, Grouped rewards std 0.001, KL 0.000, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -1.532, r1_count_xml std 0.005, Learning Rate 1.000e-05, It/sec 0.079, Tokens/sec 25.072, Peak mem 7.254 GB
Iter 2: Train loss 0.00055540, Total rewards mean -0.572, Total rewards std 0.001, Grouped rewards mean -0.572, Grouped rewards std 0.001, KL 0.006, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -2.861, r1_count_xml std 0.005, Learning Rate 1.000e-05, It/sec 0.121, Tokens/sec 36.164, Peak mem 7.254 GB
Iter 3: Train loss 0.00070858, Total rewards mean -0.842, Total rewards std 0.003, Grouped rewards mean -0.842, Grouped rewards std 0.003, KL 0.013, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -4.210, r1_count_xml std 0.013, Learning Rate 1.000e-05, It/sec 0.110, Tokens/sec 31.790, Peak mem 7.254 GB
Iter 4: Train loss 0.00070563, Total rewards mean -1.161, Total rewards std 0.005, Grouped rewards mean -1.161, Grouped rewards std 0.005, KL 0.020, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -5.806, r1_count_xml std 0.024, Learning Rate 1.000e-05, It/sec 0.105, Tokens/sec 36.961, Peak mem 7.899 GB
Iter 5: Val loss 0.00057772, Val total_rewards_mean -0.345, Val total_rewards_std 0.005, Val grouped_rewards_mean -0.345, Val grouped_rewards_std 0.005, Val kl 0.006, Val r1_accuracy_reward_func_mean 0.000, Val r1_accuracy_reward_func_std 0.000, Val r1_int_reward_func_mean 0.000, Val r1_int_reward_func_std 0.000, Val r1_strict_format_reward_func_mean 0.000, Val r1_strict_format_reward_func_std 0.000, Val r1_soft_format_reward_func_mean 0.000, Val r1_soft_format_reward_func_std 0.000, Val r1_count_xml_mean -1.726, Val r1_count_xml_std 0.025, Val took 22.624s
Iter 5: Train loss 0.00059050, Total rewards mean -1.399, Total rewards std 0.006, Grouped rewards mean -1.399, Grouped rewards std 0.006, KL 0.026, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean -6.994, r1_count_xml std 0.029, Learning Rate 1.000e-05, It/sec 0.156, Tokens/sec 39.539, Peak mem 7.899 GB
Saved final weights to /Users/gokdenizgulmez/Desktop/test-grpo-full/adapters.safetensors.

@mark-lord
Copy link

🥳🥳🥳

Working on my machine too! Not to mention it's plug-and-play with QLoRA as well, which I don't think TRL even has 😁 And already used it to get an 'aha' moment out of Phi-14b and do some knowledge injection 🚀
Screenshot 2025-02-04 at 02 10 40

@Goekdeniz-Guelmez
Copy link
Contributor Author

First successful training run (I started it yesterday evening, with wen 1.5 instruct):

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --train \
    --data /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test_grpo \
    --iters 100 \
    --batch-size 1 \
    --num-layers 4 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path /Users/gokdenizgulmez/Library/Mobile\ Documents/com\~apple\~CloudDocs/Datastes/MLX/test-grpo-full \
    --max-seq-length 512 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 4 \
    --max-completion-length 512 \
    --use-chat-template \
    --save-every 10
Loading pretrained model
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 84368.18it/s]
Loading datasets
Training
Trainable parameters: 0.071% (1.090M/1543.714M)
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10802.11it/s]
Starting GRPO training with 5 reward functions..., iters: 100
Iter 1: Val loss 0.00000000, Val total_rewards_mean 0.100, Val total_rewards_std 0.000, Val grouped_rewards_mean 0.100, Val grouped_rewards_std 0.000, Val kl 0.000, Val r1_accuracy_reward_func_mean 0.000, , Val r1_int_reward_func_mean 0.000, , Val r1_strict_format_reward_func_mean 0.000, , Val r1_soft_format_reward_func_mean 0.500, , Val r1_count_xml_mean 0.000, , Val took 43.982s
Iter 1: Train loss 0.00000000, Total rewards mean 0.100, Total rewards std 0.000, Grouped rewards mean 0.100, Grouped rewards std 0.000, KL 0.000, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 0.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.306, Peak mem 14.462 GB
Iter 2: Train loss -0.00000001, Total rewards mean 0.225, Total rewards std 0.043, Grouped rewards mean 0.225, Grouped rewards std 0.043, KL 0.000, r1_accuracy_reward_func mean 0.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 1.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.018, Tokens/sec 14.613, Peak mem 14.462 GB
Iter 3: Train loss 0.00092913, Total rewards mean 0.450, Total rewards std 0.260, Grouped rewards mean 0.450, Grouped rewards std 0.260, KL 0.006, r1_accuracy_reward_func mean 0.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 1.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.004, Tokens/sec 5.458, Peak mem 19.599 GB
Iter 4: Train loss -0.00720918, Total rewards mean 0.625, Total rewards std 0.303, Grouped rewards mean 0.625, Grouped rewards std 0.303, KL 0.030, r1_accuracy_reward_func mean 0.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 2.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.543, Peak mem 22.808 GB
Iter 5: Train loss 0.03525679, Total rewards mean 0.725, Total rewards std 0.303, Grouped rewards mean 0.725, Grouped rewards std 0.303, KL 0.382, r1_accuracy_reward_func mean 0.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 2.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.012, Tokens/sec 10.223, Peak mem 22.808 GB
Iter 6: Train loss 0.02103449, Total rewards mean 0.950, Total rewards std 0.520, Grouped rewards mean 0.950, Grouped rewards std 0.520, KL 0.660, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.750, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 3.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.604, Peak mem 22.808 GB
Iter 7: Train loss 0.99099368, Total rewards mean 1.075, Total rewards std 0.563, Grouped rewards mean 1.075, Grouped rewards std 0.563, KL 2.145, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 3.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.136, Peak mem 22.808 GB
Iter 8: Train loss 0.00695075, Total rewards mean 1.175, Total rewards std 0.563, Grouped rewards mean 1.175, Grouped rewards std 0.563, KL 2.214, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 4.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 5.068, Peak mem 22.808 GB
Iter 9: Train loss 0.00384880, Total rewards mean 1.275, Total rewards std 0.563, Grouped rewards mean 1.275, Grouped rewards std 0.563, KL 2.253, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 0.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 4.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.002, Tokens/sec 3.862, Peak mem 24.761 GB
Iter 10: Train loss -0.07108963, Total rewards mean 1.400, Total rewards std 0.606, Grouped rewards mean 1.400, Grouped rewards std 0.606, KL 2.777, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 5.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 7.776, Peak mem 24.761 GB
Iter 10: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000010_adapters.safetensors.
Iter 11: Train loss 0.01276297, Total rewards mean 1.500, Total rewards std 0.606, Grouped rewards mean 1.500, Grouped rewards std 0.606, KL 2.904, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 5.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.923, Peak mem 24.761 GB
Iter 12: Train loss 26.29608917, Total rewards mean 1.600, Total rewards std 0.606, Grouped rewards mean 1.600, Grouped rewards std 0.606, KL 265.865, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 6.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 7.829, Peak mem 24.761 GB
Iter 13: Train loss 0.12213670, Total rewards mean 1.700, Total rewards std 0.606, Grouped rewards mean 1.700, Grouped rewards std 0.606, KL 267.087, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 6.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.002, Tokens/sec 4.310, Peak mem 24.761 GB
Iter 14: Train loss 0.49749294, Total rewards mean 1.800, Total rewards std 0.606, Grouped rewards mean 1.800, Grouped rewards std 0.606, KL 272.062, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 7.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.254, Peak mem 26.339 GB
Iter 15: Train loss 0.42647439, Total rewards mean 1.900, Total rewards std 0.606, Grouped rewards mean 1.900, Grouped rewards std 0.606, KL 276.326, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 7.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.041, Peak mem 26.339 GB
Iter 16: Train loss 0.20111339, Total rewards mean 2.000, Total rewards std 0.606, Grouped rewards mean 2.000, Grouped rewards std 0.606, KL 278.337, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 8.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.359, Peak mem 26.339 GB
Iter 17: Train loss 0.32308382, Total rewards mean 2.100, Total rewards std 0.606, Grouped rewards mean 2.100, Grouped rewards std 0.606, KL 281.568, r1_accuracy_reward_func mean 1.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 8.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 10.312, Peak mem 26.339 GB
Iter 18: Train loss -395.80285645, Total rewards mean 2.350, Total rewards std 0.812, Grouped rewards mean 2.350, Grouped rewards std 0.812, KL 933.039, r1_accuracy_reward_func mean 1.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 9.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.002, Tokens/sec 4.542, Peak mem 26.339 GB
Iter 19: Train loss 72.81225586, Total rewards mean 2.450, Total rewards std 0.812, Grouped rewards mean 2.450, Grouped rewards std 0.812, KL 1661.161, r1_accuracy_reward_func mean 1.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 9.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 5.185, Peak mem 26.339 GB
Iter 20: Train loss -1297.10375977, Total rewards mean 2.575, Total rewards std 0.856, Grouped rewards mean 2.575, Grouped rewards std 0.856, KL 20457.059, r1_accuracy_reward_func mean 1.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 10.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.373, Peak mem 26.339 GB
Iter 20: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000020_adapters.safetensors.
Iter 21: Train loss -2145.12768555, Total rewards mean 2.800, Total rewards std 1.072, Grouped rewards mean 2.800, Grouped rewards std 1.072, KL 42046.301, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 10.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.004, Tokens/sec 5.671, Peak mem 26.339 GB
Iter 22: Train loss 1376.96398926, Total rewards mean 2.900, Total rewards std 1.072, Grouped rewards mean 2.900, Grouped rewards std 1.072, KL 55815.941, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 11.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.817, Peak mem 26.339 GB
Iter 23: Train loss 625.93188477, Total rewards mean 3.000, Total rewards std 1.072, Grouped rewards mean 3.000, Grouped rewards std 1.072, KL 62075.258, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 11.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.004, Tokens/sec 5.947, Peak mem 26.339 GB
Iter 24: Train loss 5238.08300781, Total rewards mean 3.100, Total rewards std 1.072, Grouped rewards mean 3.100, Grouped rewards std 1.072, KL 114456.086, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 12.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 7.185, Peak mem 26.339 GB
Iter 25: Train loss 27.81487846, Total rewards mean 3.200, Total rewards std 1.072, Grouped rewards mean 3.200, Grouped rewards std 1.072, KL 114734.234, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 12.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.021, Tokens/sec 18.334, Peak mem 26.339 GB
Iter 26: Train loss 166.46582031, Total rewards mean 3.300, Total rewards std 1.072, Grouped rewards mean 3.300, Grouped rewards std 1.072, KL 116398.891, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 13.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 10.694, Peak mem 26.339 GB
Iter 27: Train loss 30.82451057, Total rewards mean 3.400, Total rewards std 1.072, Grouped rewards mean 3.400, Grouped rewards std 1.072, KL 116707.133, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 13.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.484, Peak mem 26.339 GB
Iter 28: Train loss 88.30443573, Total rewards mean 3.500, Total rewards std 1.072, Grouped rewards mean 3.500, Grouped rewards std 1.072, KL 117590.180, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 14.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 10.435, Peak mem 26.339 GB
Iter 29: Train loss -27.37774658, Total rewards mean 3.650, Total rewards std 1.122, Grouped rewards mean 3.650, Grouped rewards std 1.122, KL 118514.703, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.750, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 14.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.453, Peak mem 26.339 GB
Iter 30: Train loss -46.69438171, Total rewards mean 3.775, Total rewards std 1.165, Grouped rewards mean 3.775, Grouped rewards std 1.165, KL 118919.430, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 1.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 15.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.492, Peak mem 26.339 GB
Iter 30: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000030_adapters.safetensors.
Iter 31: Train loss 98.57348633, Total rewards mean 3.900, Total rewards std 1.209, Grouped rewards mean 3.900, Grouped rewards std 1.209, KL 119324.195, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 15.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 8.711, Peak mem 26.339 GB
Iter 32: Train loss -5.05148315, Total rewards mean 4.050, Total rewards std 1.259, Grouped rewards mean 4.050, Grouped rewards std 1.259, KL 119702.398, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 16.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.013, Tokens/sec 11.589, Peak mem 26.339 GB
Iter 33: Train loss 55.24956131, Total rewards mean 4.150, Total rewards std 1.259, Grouped rewards mean 4.150, Grouped rewards std 1.259, KL 120254.891, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 16.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.367, Peak mem 26.339 GB
Iter 34: Train loss 800.51782227, Total rewards mean 4.250, Total rewards std 1.259, Grouped rewards mean 4.250, Grouped rewards std 1.259, KL 128260.070, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 17.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 7.864, Peak mem 26.339 GB
Iter 35: Train loss 102.13351440, Total rewards mean 4.350, Total rewards std 1.259, Grouped rewards mean 4.350, Grouped rewards std 1.259, KL 129281.406, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 17.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 8.750, Peak mem 26.339 GB
Iter 36: Train loss 113.06057739, Total rewards mean 4.475, Total rewards std 1.302, Grouped rewards mean 4.475, Grouped rewards std 1.302, KL 129749.656, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 18.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 9.937, Peak mem 26.339 GB
Iter 37: Train loss -45.16821289, Total rewards mean 4.625, Total rewards std 1.352, Grouped rewards mean 4.625, Grouped rewards std 1.352, KL 129960.039, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 18.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.298, Peak mem 26.339 GB
Iter 38: Train loss 174.49714661, Total rewards mean 4.725, Total rewards std 1.352, Grouped rewards mean 4.725, Grouped rewards std 1.352, KL 131705.016, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 19.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 7.977, Peak mem 26.339 GB
Iter 39: Train loss 87.66883087, Total rewards mean 4.825, Total rewards std 1.352, Grouped rewards mean 4.825, Grouped rewards std 1.352, KL 132581.703, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 19.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.022, Tokens/sec 16.837, Peak mem 26.339 GB
Iter 40: Train loss 16397.69531250, Total rewards mean 4.975, Total rewards std 1.402, Grouped rewards mean 4.975, Grouped rewards std 1.402, KL 148128.891, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 2.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 20.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.536, Peak mem 26.339 GB
Iter 40: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000040_adapters.safetensors.
Iter 41: Train loss 36.13301086, Total rewards mean 5.100, Total rewards std 1.445, Grouped rewards mean 5.100, Grouped rewards std 1.445, KL 148542.891, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 20.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.897, Peak mem 26.339 GB
Iter 42: Train loss 88.15919495, Total rewards mean 5.225, Total rewards std 1.489, Grouped rewards mean 5.225, Grouped rewards std 1.489, KL 149004.797, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 21.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.868, Peak mem 26.339 GB
Iter 43: Train loss -18.44103241, Total rewards mean 5.375, Total rewards std 1.539, Grouped rewards mean 5.375, Grouped rewards std 1.539, KL 149403.938, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 21.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.013, Tokens/sec 11.963, Peak mem 26.339 GB
Iter 44: Train loss 11.96199799, Total rewards mean 5.500, Total rewards std 1.582, Grouped rewards mean 5.500, Grouped rewards std 1.582, KL 150059.156, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 22.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.211, Peak mem 26.339 GB
Iter 45: Train loss 6395.98144531, Total rewards mean 5.600, Total rewards std 1.582, Grouped rewards mean 5.600, Grouped rewards std 1.582, KL 214018.969, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 22.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 10.528, Peak mem 26.339 GB
Iter 46: Train loss 208.30914307, Total rewards mean 5.775, Total rewards std 1.625, Grouped rewards mean 5.775, Grouped rewards std 1.625, KL 215086.188, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 23.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 9.215, Peak mem 26.339 GB
Iter 47: Train loss 109.05796051, Total rewards mean 5.875, Total rewards std 1.625, Grouped rewards mean 5.875, Grouped rewards std 1.625, KL 216176.766, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 3.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 23.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.014, Tokens/sec 11.700, Peak mem 26.339 GB
Iter 48: Train loss -2.79139709, Total rewards mean 6.000, Total rewards std 1.669, Grouped rewards mean 6.000, Grouped rewards std 1.669, KL 216898.953, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 24.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.014, Tokens/sec 11.285, Peak mem 26.339 GB
Iter 49: Train loss 74.90339661, Total rewards mean 6.100, Total rewards std 1.669, Grouped rewards mean 6.100, Grouped rewards std 1.669, KL 217647.984, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 24.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.482, Peak mem 27.511 GB
Iter 50: Train loss 77.14464569, Total rewards mean 6.200, Total rewards std 1.669, Grouped rewards mean 6.200, Grouped rewards std 1.669, KL 218419.438, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 25.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.629, Peak mem 27.511 GB
Iter 50: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000050_adapters.safetensors.
Iter 51: Train loss 86.46633911, Total rewards mean 6.300, Total rewards std 1.669, Grouped rewards mean 6.300, Grouped rewards std 1.669, KL 219284.094, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 25.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.030, Tokens/sec 19.767, Peak mem 27.511 GB
Iter 52: Train loss 35.70820618, Total rewards mean 6.400, Total rewards std 1.669, Grouped rewards mean 6.400, Grouped rewards std 1.669, KL 219641.172, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 26.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.858, Peak mem 27.511 GB
Iter 53: Train loss -43.56660461, Total rewards mean 6.575, Total rewards std 1.712, Grouped rewards mean 6.575, Grouped rewards std 1.712, KL 220068.359, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 26.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.002, Tokens/sec 3.659, Peak mem 27.511 GB
Iter 54: Train loss 50.60503387, Total rewards mean 6.700, Total rewards std 1.755, Grouped rewards mean 6.700, Grouped rewards std 1.755, KL 220556.141, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 27.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.723, Peak mem 27.511 GB
Iter 55: Train loss 98.06455994, Total rewards mean 6.800, Total rewards std 1.755, Grouped rewards mean 6.800, Grouped rewards std 1.755, KL 221536.781, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 27.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.012, Tokens/sec 11.459, Peak mem 27.511 GB
Iter 56: Train loss 67.69333649, Total rewards mean 6.900, Total rewards std 1.755, Grouped rewards mean 6.900, Grouped rewards std 1.755, KL 222213.719, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 28.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.016, Tokens/sec 14.259, Peak mem 27.511 GB
Iter 57: Train loss -108.98173523, Total rewards mean 7.050, Total rewards std 1.805, Grouped rewards mean 7.050, Grouped rewards std 1.805, KL 222991.188, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 4.750, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 28.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.012, Tokens/sec 10.946, Peak mem 27.511 GB
Iter 58: Train loss 110.74307251, Total rewards mean 7.200, Total rewards std 1.855, Grouped rewards mean 7.200, Grouped rewards std 1.855, KL 223916.969, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 29.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 10.190, Peak mem 27.511 GB
Iter 59: Train loss 31.76986313, Total rewards mean 7.300, Total rewards std 1.855, Grouped rewards mean 7.300, Grouped rewards std 1.855, KL 224234.672, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 29.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 10.455, Peak mem 27.511 GB
Iter 60: Train loss 61.54369354, Total rewards mean 7.425, Total rewards std 1.898, Grouped rewards mean 7.425, Grouped rewards std 1.898, KL 224487.328, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 30.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 10.312, Peak mem 27.511 GB
Iter 60: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000060_adapters.safetensors.
Iter 61: Train loss 84.28865051, Total rewards mean 7.525, Total rewards std 1.898, Grouped rewards mean 7.525, Grouped rewards std 1.898, KL 225330.219, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 30.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.004, Tokens/sec 5.559, Peak mem 27.511 GB
Iter 62: Train loss 13.45783997, Total rewards mean 7.675, Total rewards std 1.948, Grouped rewards mean 7.675, Grouped rewards std 1.948, KL 225996.781, r1_accuracy_reward_func mean 2.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 31.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.723, Peak mem 27.511 GB
Iter 63: Train loss 38.53865051, Total rewards mean 7.925, Total rewards std 2.155, Grouped rewards mean 7.925, Grouped rewards std 2.155, KL 226373.031, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 31.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 11.129, Peak mem 27.511 GB
Iter 64: Train loss 44.18389511, Total rewards mean 8.025, Total rewards std 2.155, Grouped rewards mean 8.025, Grouped rewards std 2.155, KL 226814.875, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 32.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.020, Tokens/sec 16.129, Peak mem 27.511 GB
Iter 65: Train loss -26.00822067, Total rewards mean 8.150, Total rewards std 2.198, Grouped rewards mean 8.150, Grouped rewards std 2.198, KL 227133.406, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.750, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 32.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.378, Peak mem 27.511 GB
Iter 66: Train loss 4.27447510, Total rewards mean 8.275, Total rewards std 2.241, Grouped rewards mean 8.275, Grouped rewards std 2.241, KL 227553.109, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 5.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 33.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.249, Peak mem 27.511 GB
Iter 67: Train loss -4.97953796, Total rewards mean 8.400, Total rewards std 2.285, Grouped rewards mean 8.400, Grouped rewards std 2.285, KL 228406.172, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 33.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.466, Peak mem 27.511 GB
Iter 68: Train loss 56.15468597, Total rewards mean 8.500, Total rewards std 2.285, Grouped rewards mean 8.500, Grouped rewards std 2.285, KL 228967.719, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 34.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.396, Peak mem 27.511 GB
Iter 69: Train loss -66.75144958, Total rewards mean 8.675, Total rewards std 2.328, Grouped rewards mean 8.675, Grouped rewards std 2.328, KL 229663.234, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 34.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.546, Peak mem 27.511 GB
Iter 70: Train loss 185.10906982, Total rewards mean 8.825, Total rewards std 2.378, Grouped rewards mean 8.825, Grouped rewards std 2.378, KL 230606.062, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 35.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.474, Peak mem 27.511 GB
Iter 70: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000070_adapters.safetensors.
Iter 71: Train loss 15.38090515, Total rewards mean 8.925, Total rewards std 2.378, Grouped rewards mean 8.925, Grouped rewards std 2.378, KL 230759.875, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 35.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.139, Peak mem 27.511 GB
Iter 72: Train loss 27.67171860, Total rewards mean 9.025, Total rewards std 2.378, Grouped rewards mean 9.025, Grouped rewards std 2.378, KL 231036.594, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 36.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.013, Tokens/sec 13.199, Peak mem 27.511 GB
Iter 73: Train loss 155.57977295, Total rewards mean 9.125, Total rewards std 2.378, Grouped rewards mean 9.125, Grouped rewards std 2.378, KL 232592.391, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 36.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.853, Peak mem 27.511 GB
Iter 74: Train loss 66.21957397, Total rewards mean 9.225, Total rewards std 2.378, Grouped rewards mean 9.225, Grouped rewards std 2.378, KL 233254.594, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 37.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.255, Peak mem 27.511 GB
Iter 75: Train loss -0.07450867, Total rewards mean 9.375, Total rewards std 2.428, Grouped rewards mean 9.375, Grouped rewards std 2.428, KL 233577.062, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 6.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 37.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 7.221, Peak mem 27.511 GB
Iter 76: Train loss 125.70507812, Total rewards mean 9.500, Total rewards std 2.471, Grouped rewards mean 9.500, Grouped rewards std 2.471, KL 234476.688, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 38.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 11.526, Peak mem 27.511 GB
Iter 77: Train loss 21.18598938, Total rewards mean 9.650, Total rewards std 2.521, Grouped rewards mean 9.650, Grouped rewards std 2.521, KL 235087.250, r1_accuracy_reward_func mean 2.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.250, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 38.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.014, Tokens/sec 12.868, Peak mem 27.511 GB
Iter 78: Train loss -2.86486816, Total rewards mean 9.875, Total rewards std 2.738, Grouped rewards mean 9.875, Grouped rewards std 2.738, KL 236098.234, r1_accuracy_reward_func mean 3.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 39.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 8.552, Peak mem 27.511 GB
Iter 79: Train loss 45.17611313, Total rewards mean 9.975, Total rewards std 2.738, Grouped rewards mean 9.975, Grouped rewards std 2.738, KL 236550.000, r1_accuracy_reward_func mean 3.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 39.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.013, Tokens/sec 12.224, Peak mem 27.511 GB
Iter 80: Train loss 39.73171997, Total rewards mean 10.075, Total rewards std 2.738, Grouped rewards mean 10.075, Grouped rewards std 2.738, KL 236947.312, r1_accuracy_reward_func mean 3.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 40.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.003, Tokens/sec 4.677, Peak mem 27.511 GB
Iter 80: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000080_adapters.safetensors.
Iter 81: Train loss 5.47789764, Total rewards mean 10.200, Total rewards std 2.781, Grouped rewards mean 10.200, Grouped rewards std 2.781, KL 237397.906, r1_accuracy_reward_func mean 3.000, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 40.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 7.306, Peak mem 27.511 GB
Iter 82: Train loss 44.68563843, Total rewards mean 10.425, Total rewards std 2.997, Grouped rewards mean 10.425, Grouped rewards std 2.997, KL 238397.422, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 41.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.012, Tokens/sec 10.631, Peak mem 27.511 GB
Iter 83: Train loss 48.27146912, Total rewards mean 10.525, Total rewards std 2.997, Grouped rewards mean 10.525, Grouped rewards std 2.997, KL 238880.141, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 41.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.004, Tokens/sec 5.144, Peak mem 27.511 GB
Iter 84: Train loss 49.22350311, Total rewards mean 10.625, Total rewards std 2.997, Grouped rewards mean 10.625, Grouped rewards std 2.997, KL 239372.375, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 7.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 42.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.936, Peak mem 27.511 GB
Iter 85: Train loss 77.75291443, Total rewards mean 10.800, Total rewards std 3.041, Grouped rewards mean 10.800, Grouped rewards std 3.041, KL 240025.516, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 42.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.008, Tokens/sec 9.181, Peak mem 27.511 GB
Iter 86: Train loss 62.07151794, Total rewards mean 10.900, Total rewards std 3.041, Grouped rewards mean 10.900, Grouped rewards std 3.041, KL 240646.234, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 43.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.011, Tokens/sec 10.294, Peak mem 27.511 GB
Iter 87: Train loss 57.66293335, Total rewards mean 11.025, Total rewards std 3.084, Grouped rewards mean 11.025, Grouped rewards std 3.084, KL 241455.469, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 43.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.306, Peak mem 27.511 GB
Iter 88: Train loss 66.10350800, Total rewards mean 11.125, Total rewards std 3.084, Grouped rewards mean 11.125, Grouped rewards std 3.084, KL 242116.500, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 44.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 10.568, Peak mem 27.511 GB
Iter 89: Train loss 8987.12011719, Total rewards mean 11.275, Total rewards std 3.134, Grouped rewards mean 11.275, Grouped rewards std 3.134, KL 295931.406, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 44.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 10.504, Peak mem 27.511 GB
Iter 90: Train loss 97.58233643, Total rewards mean 11.375, Total rewards std 3.134, Grouped rewards mean 11.375, Grouped rewards std 3.134, KL 296907.219, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 45.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.019, Tokens/sec 13.601, Peak mem 27.511 GB
Iter 90: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000090_adapters.safetensors.
Iter 91: Train loss 54.97767639, Total rewards mean 11.475, Total rewards std 3.134, Grouped rewards mean 11.475, Grouped rewards std 3.134, KL 297457.000, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.375, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 45.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.350, Peak mem 27.511 GB
Iter 92: Train loss -188.48178101, Total rewards mean 11.600, Total rewards std 3.177, Grouped rewards mean 11.600, Grouped rewards std 3.177, KL 298459.406, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 46.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.005, Tokens/sec 6.132, Peak mem 27.511 GB
Iter 93: Train loss 68.02745056, Total rewards mean 11.700, Total rewards std 3.177, Grouped rewards mean 11.700, Grouped rewards std 3.177, KL 299139.688, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.500, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 46.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.009, Tokens/sec 9.167, Peak mem 27.511 GB
Iter 94: Train loss -22.29962540, Total rewards mean 11.825, Total rewards std 3.221, Grouped rewards mean 11.825, Grouped rewards std 3.221, KL 299426.719, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.625, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 47.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 7.086, Peak mem 27.511 GB
Iter 95: Train loss 102.39406586, Total rewards mean 11.950, Total rewards std 3.264, Grouped rewards mean 11.950, Grouped rewards std 3.264, KL 299919.156, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.750, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 47.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.068, Peak mem 27.511 GB
Iter 96: Train loss 1.27645874, Total rewards mean 12.075, Total rewards std 3.307, Grouped rewards mean 12.075, Grouped rewards std 3.307, KL 300521.344, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 48.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.007, Tokens/sec 8.341, Peak mem 27.511 GB
Iter 97: Train loss 41.48004532, Total rewards mean 12.175, Total rewards std 3.307, Grouped rewards mean 12.175, Grouped rewards std 3.307, KL 300936.156, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 8.875, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 48.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 9.634, Peak mem 27.511 GB
Iter 98: Train loss 65.34966278, Total rewards mean 12.300, Total rewards std 3.351, Grouped rewards mean 12.300, Grouped rewards std 3.351, KL 301450.594, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 9.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 49.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.010, Tokens/sec 10.540, Peak mem 27.511 GB
Iter 99: Train loss 741.04760742, Total rewards mean 12.400, Total rewards std 3.351, Grouped rewards mean 12.400, Grouped rewards std 3.351, KL 308861.062, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 9.000, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 49.500, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.006, Tokens/sec 8.029, Peak mem 27.511 GB
Iter 100: Val loss 45.16326141, Val total_rewards_mean 0.100, Val total_rewards_std 0.000, Val grouped_rewards_mean 0.100, Val grouped_rewards_std 0.000, Val kl 451.633, Val r1_accuracy_reward_func_mean 0.000, , Val r1_int_reward_func_mean 0.000, , Val r1_strict_format_reward_func_mean 0.000, , Val r1_soft_format_reward_func_mean 0.500, , Val r1_count_xml_mean 0.000, , Val took 48.726s
Iter 100: Train loss 24.07788086, Total rewards mean 12.525, Total rewards std 3.394, Grouped rewards mean 12.525, Grouped rewards std 3.394, KL 309567.125, r1_accuracy_reward_func mean 3.500, r1_accuracy_reward_func std 0.000, r1_int_reward_func mean 9.125, r1_int_reward_func std 0.000, r1_strict_format_reward_func mean 0.000, r1_strict_format_reward_func std 0.000, r1_soft_format_reward_func mean 50.000, r1_soft_format_reward_func std 0.000, r1_count_xml mean 0.000, r1_count_xml std 0.000, Learning Rate 1.000e-05, It/sec 0.015, Tokens/sec 12.025, Peak mem 27.511 GB
Iter 100: Saved adapter weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors and /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/0000100_adapters.safetensors.
Saved final weights to /Users/gokdenizgulmez/Library/Mobile Documents/com~apple~CloudDocs/Datastes/MLX/test-grpo-full/adapters.safetensors.

After:

--prompt "give me a cool math proof"

<think> A classic cool math proof is the proof that the square root of 2 is irrational. This proof was given by the ancient Greek mathematician Pythagoras and his followers. The proof is based on the assumption that the square root of 2 is rational, which means it can be expressed as a fraction of two integers. If the square root of 2 is rational, then there exist integers an and b such that a/b = √2. Squaring both sides of the equation gives us a^2 = 2b^2, which implies that a^2 is even. Since a^2 is even, a must also be even. Let a = 2k for some integer k. Substituting this into the equation gives us (2k)^2 = 2b^2, which simplifies to 4k^2 = 2b^2, or 2k^2 = b^2. This implies that b^2 is even, and therefore b must also be even. But this contradicts our assumption that an and b are both odd. Therefore, the square root of 2 cannot be expressed as a fraction of two integers, and is therefore irrational. </think><answer> The proof that the square root of 2 is irrational is based on the assumption that it can be expressed as a fraction of two integers. If the square root of 2 is rational, then there exist integers an and b such that a/b = √2. Squaring both sides of the equation gives us a^2 = 2b^2, which implies that a^2 is even. Since a^2 is even, a must also be even. Let a = 2k for some integer k. Substituting this into the equation gives us (2k)^2 = 2b^2, which simplifies to 4k^2 = 2b^2, or 2k^2 = b^2. This implies that b^2 is even, and therefore b must also be even. But this contradicts our assumption that an and b are both odd. Therefore, the square root of 2 cannot be expressed as a fraction of two integers, and is therefore irrational. </answer>

--prompt "what is 150 * 8"

<think> 150 * 8 = 150 * (4 * 2) = (150 * 4) * 2 = 600 * 2 = 1200 </think><answer> 1200 </answer>

@Goekdeniz-Guelmez
Copy link
Contributor Author

I'm starting a new one with the base model and the new commits

@anson1788
Copy link

congratulation , by the way, what machine spec are u using?

@Goekdeniz-Guelmez
Copy link
Contributor Author

M4 Mac mini 32 GB

@anson1788
Copy link

by the way, any chance you can share you dataset want to have a loook

Copy link

@Guo-astro Guo-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using gpu possible save some main memory

llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Feb 11, 2025

Model: Qwen/Qwen2.5-1.5B
Dataset: Goastro/mlx-grpo-dataset
Prompt: Ann is cutting fabric to make curtains. She cuts a 4 foot by 6 foot rectangle for the living room, and a 2 foot by 4 foot rectangle for the bedroom. If the bolt of fabric is 16 feet by 12 feet, how much fabric is left in square feet?.

Before:

Assistant: <think>Let's break down the problem step by step.</think>
<answer>First, we need to calculate the total area of the fabric used for the curtains.</answer>
<think>Area of the living room curtain = 4 feet x 6 feet = 24 square feet</think>
<think>Area of the bedroom curtain = 2 feet x 4 feet = 8 square feet</think>
<answer>Total area used = 24 square feet + 8 square feet = 32 square feet</answer>
<think>Next, we need to calculate the total area of the bolt of fabric.</answer>
<think>Area of the bolt of fabric = 16 feet x 12 feet = 192 square feet</think>
<answer>Total area used = 32 square feet</answer>
<think>Finally, we can calculate the remaining fabric by subtracting the total area used from the total area of the bolt of fabric.</answer>
<answer>Remaining fabric = 192 square feet - 32 square feet = 160 square feet</answer>

After:

Assistant: <think>First, we need to calculate the total area of the fabric used for the curtains. The living room curtain is 4 feet by 6 feet, so its area is 4 * 6 = 24 square feet. The bedroom curtain is 2 feet by 4 feet, so its area is 2 * 4 = 8 square feet. The total area of the fabric used is 24 + 8 = 32 square feet. The bolt of fabric is 16 feet by 12 feet, so its total area is 16 * 12 = 192 square feet. Therefore, the amount of fabric left is 192 - 32 = 160 square feet.</think><answer>160</answer>

Copy link

@Guo-astro Guo-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to read code 1/n

llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved
Copy link

@Guo-astro Guo-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly adding 3 types :

RewardFunction GRPOExample GRPOBatch

llms/mlx_lm/tuner/datasets.py Show resolved Hide resolved
llms/mlx_lm/tuner/datasets.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved
llms/mlx_lm/tuner/grpo_trainer.py Outdated Show resolved Hide resolved
@Goekdeniz-Guelmez
Copy link
Contributor Author

Thanks @Guo-astro however this did make the computation sky rocket Val from 70s - 80s to 130s - 150s and training has the same too, probably due to copying data multiple times through class instantiation.

@Guo-astro
Copy link

Guo-astro commented Feb 12, 2025

True. Then I think we need to to use those as few as possible. Python is managing all the class instance memories so it could be really slow😅
So you can just ignore those comments and go forward 🔥

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Feb 12, 2025

I think I'll use a hybrid approach with your suggestions, because they make it more stable, maintainable, and easier to debug and test. Thanks for your help!!!!!!

@madroidmaq
Copy link
Contributor

Cannot load the existing dataset on HF, the following error was found when using Goastro/mlx-grpo-dataset for testing.

input:

python -m mlx_lm.lora \
    --model Qwen/Qwen2.5-0.5B \
    --train \
    --data Goastro/mlx-grpo-dataset \
    --iters 5 \
    --batch-size 1 \
    --num-layers 8 \
    --val-batches 1 \
    --steps-per-report 1 \
    --adapter-path ~/Desktop/test-grpo-full \
    --max-seq-length 255 \
    --grad-checkpoint \
    --training-mode grpo \
    --fine-tune-type lora \
    --beta 0.1 \
    --steps-per-eval 500 \
    --group-size 2 \
    --max-completion-length 6

output:

Loading pretrained model
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 59313.39it/s]
Loading datasets
Loading Hugging Face dataset Goastro/mlx-grpo-dataset.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/madroid/Desktop/mlx-examples/llms/mlx_lm/lora.py", line 432, in <module>
    main()
  File "/Users/madroid/Desktop/mlx-examples/llms/mlx_lm/lora.py", line 428, in main
    run(types.SimpleNamespace(**args))
  File "/Users/madroid/Desktop/mlx-examples/llms/mlx_lm/lora.py", line 391, in run
    train_set, valid_set, test_set = load_dataset(args, tokenizer)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/madroid/Desktop/mlx-examples/llms/mlx_lm/tuner/datasets.py", line 313, in load_dataset
    train, valid, test = load_hf_dataset(args.data, tokenizer, args)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: load_hf_dataset() missing 1 required positional argument: 'config'

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Feb 12, 2025

Should be fixed now I also suggest you to use the --use-prompt argument when its a base model and the dataset having only prompts and answers or use my Goekdeniz-Guelmez/GRPO-MLX-Dataset its gsm8k but prompted correctly like in the DeepSeek R1 paper so you dont need to use the --use-prompt.

@Goekdeniz-Guelmez
Copy link
Contributor Author

@mark-lord should be able to run it now!! If you want to use a base model you can use the Goastro/mlx-grpo-dataset but need to include the --use-prompt argument in the call or you can use my Goekdeniz-Guelmez/GRPO-MLX-Dataset which has the prompting already in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.