Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues reproducing Llama3.2-1B results on MMLU #2528

Closed
VoiceBeer opened this issue Dec 1, 2024 · 2 comments
Closed

Issues reproducing Llama3.2-1B results on MMLU #2528

VoiceBeer opened this issue Dec 1, 2024 · 2 comments

Comments

@VoiceBeer
Copy link

Results Log

hf (pretrained=/data/models/meta-llama/Llama-3.2-1B,dtype=auto,), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto (4)

Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.3107 ± 0.0039
- humanities 2 none acc 0.2912 ± 0.0066
- formal_logic 1 none 5 acc 0.1984 ± 0.0357
- high_school_european_history 1 none 5 acc 0.3758 ± 0.0378
- high_school_us_history 1 none 5 acc 0.3284 ± 0.0330
- high_school_world_history 1 none 5 acc 0.3291 ± 0.0306
- international_law 1 none 5 acc 0.4215 ± 0.0451
- jurisprudence 1 none 5 acc 0.3889 ± 0.0471
- logical_fallacies 1 none 5 acc 0.2761 ± 0.0351
- moral_disputes 1 none 5 acc 0.2775 ± 0.0241
- moral_scenarios 1 none 5 acc 0.2380 ± 0.0142
- philosophy 1 none 5 acc 0.3183 ± 0.0265
- prehistory 1 none 5 acc 0.3673 ± 0.0268
- professional_law 1 none 5 acc 0.2595 ± 0.0112
- world_religions 1 none 5 acc 0.4386 ± 0.0381
- other 2 none acc 0.3602 ± 0.0086
- business_ethics 1 none 5 acc 0.3600 ± 0.0482
- clinical_knowledge 1 none 5 acc 0.3358 ± 0.0291
- college_medicine 1 none 5 acc 0.2890 ± 0.0346
- global_facts 1 none 5 acc 0.2000 ± 0.0402
- human_aging 1 none 5 acc 0.3812 ± 0.0326
- management 1 none 5 acc 0.3301 ± 0.0466
- marketing 1 none 5 acc 0.4103 ± 0.0322
- medical_genetics 1 none 5 acc 0.3900 ± 0.0490
- miscellaneous 1 none 5 acc 0.4266 ± 0.0177
- nutrition 1 none 5 acc 0.3824 ± 0.0278
- professional_accounting 1 none 5 acc 0.2624 ± 0.0262
- professional_medicine 1 none 5 acc 0.2868 ± 0.0275
- virology 1 none 5 acc 0.4036 ± 0.0382
- social sciences 2 none acc 0.3191 ± 0.0084
- econometrics 1 none 5 acc 0.2807 ± 0.0423
- high_school_geography 1 none 5 acc 0.3687 ± 0.0344
- high_school_government_and_politics 1 none 5 acc 0.3472 ± 0.0344
- high_school_macroeconomics 1 none 5 acc 0.2462 ± 0.0218
- high_school_microeconomics 1 none 5 acc 0.2605 ± 0.0285
- high_school_psychology 1 none 5 acc 0.3376 ± 0.0203
- human_sexuality 1 none 5 acc 0.3511 ± 0.0419
- professional_psychology 1 none 5 acc 0.3023 ± 0.0186
- public_relations 1 none 5 acc 0.3000 ± 0.0439
- security_studies 1 none 5 acc 0.3510 ± 0.0306
- sociology 1 none 5 acc 0.3284 ± 0.0332
- us_foreign_policy 1 none 5 acc 0.5200 ± 0.0502
- stem 2 none acc 0.2829 ± 0.0080
- abstract_algebra 1 none 5 acc 0.2600 ± 0.0441
- anatomy 1 none 5 acc 0.3704 ± 0.0417
- astronomy 1 none 5 acc 0.2303 ± 0.0343
- college_biology 1 none 5 acc 0.2500 ± 0.0362
- college_chemistry 1 none 5 acc 0.2200 ± 0.0416
- college_computer_science 1 none 5 acc 0.3000 ± 0.0461
- college_mathematics 1 none 5 acc 0.2500 ± 0.0435
- college_physics 1 none 5 acc 0.2255 ± 0.0416
- computer_security 1 none 5 acc 0.5000 ± 0.0503
- conceptual_physics 1 none 5 acc 0.3447 ± 0.0311
- electrical_engineering 1 none 5 acc 0.2690 ± 0.0370
- elementary_mathematics 1 none 5 acc 0.2354 ± 0.0219
- high_school_biology 1 none 5 acc 0.3032 ± 0.0261
- high_school_chemistry 1 none 5 acc 0.2611 ± 0.0309
- high_school_computer_science 1 none 5 acc 0.3200 ± 0.0469
- high_school_mathematics 1 none 5 acc 0.2667 ± 0.0270
- high_school_physics 1 none 5 acc 0.2318 ± 0.0345
- high_school_statistics 1 none 5 acc 0.2778 ± 0.0305
- machine_learning 1 none 5 acc 0.3571 ± 0.0455
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.3107 ± 0.0039
- humanities 2 none acc 0.2912 ± 0.0066
- other 2 none acc 0.3602 ± 0.0086
- social sciences 2 none acc 0.3191 ± 0.0084
- stem 2 none acc 0.2829 ± 0.0080

Hi, thx for the work. I'm trying to reproduce the reported results for the Llama3.2-1B model on MMLU. The result I got is 0.3107, which is lower than the 0.493 reported by Meta

Could you pls let me know if there are any specific settings I might have missed? Thx in advance!

@wukaixingxp
Copy link

wukaixingxp commented Dec 2, 2024

Hi! @VoiceBeer Meta MMLU has a different prompt style, please follow this readme and checkout this PR to reproduce our MMLU number for 1B.

@VoiceBeer
Copy link
Author

Thx @wukaixingxp!Appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants