-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama 3.2 mmlu, math, gpqa evals to meta_eval harness #801
Add llama 3.2 mmlu, math, gpqa evals to meta_eval harness #801
Conversation
Hi @aidando73! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
tools/benchmarks/llm_eval_harness/meta_eval/meta_template/mmlu/utils.py
Outdated
Show resolved
Hide resolved
- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro | ||
- Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) | ||
- **Tasks for 3.2 pretrained models**: MMLU | ||
- Chosen because MMLU is a common eval, and is the first one shown on on [llama.com](https://llama.com) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, MMLU will be a good one to have.
]: | ||
raise ValueError( | ||
"The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 Evals collection" | ||
"The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 or 3.2 Evals collection. Note that 3.2-Instruct evals are not yet supported." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add a few more evals as well, including instruct evals.
If that SGTY, will submit follow-up PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, instruct version will be great, I think MMLU, MATH and GPQA are all great tasks to have.
7ca3132
to
4f9050f
Compare
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Regression TestsTo test regressions ran: meta-llama/Llama-3.1-8B-evals: model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals"
meta-llama/Llama-3.1-8B-Instruct-evals: model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"
All working as normal |
Hi! @aidando73 Thanks so much for your PR. I am still waiting for your instruction tasks (mmlu,math,gpqa) to be ready before running the final checks. One thing I want to point out is that this eval calculates the micro_avg of mmlu, but we report macro_avg of mmlu in the model card, see more explanation here. Therefore, as shown in our dataset metric for 1b micro_avg: 0.317 and 3b micro_avg: 0.565, your result is very close to the reported number.
|
Oh I see. So macro_avg is based off of the average across
Ok sounds good, TODO:
|
Instruct MMLU, MATH and GPQA evals
|
tools/benchmarks/llm_eval_harness/meta_eval/prepare_meta_eval.py
Outdated
Show resolved
Hide resolved
filter: | ||
- function: "regex" | ||
group_select: -1 | ||
regex_pattern: ' ([A-Z])' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa/utils.py
Outdated
Show resolved
Hide resolved
33a1f54
to
ab1b145
Compare
Regression TestsTo test regressions ran: meta-llama/Llama-3.1-8B-evals: model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals"
meta-llama/Llama-3.1-8B-Instruct-evals: model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"
meta-llama/Llama-3.2-3B-evals: model_name: "meta-llama/Llama-3.2-3B"
evals_dataset: "meta-llama/Llama-3.2-3B-evals"
All working as normal |
@wukaixingxp this PR is ready for re-review now - thank you for the feedback so far. Lmk if you want any changes. I've run the 3.1 evaluations and everything is still working |
Running 3.1 8B-instruct eval log
|
Running 3B-instruct eval log
|
|
||
- **Tasks for pretrained models**: BBH and MMLU-Pro | ||
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro | ||
These tasks are common evalutions, many of which overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo: evalutions -> evaluations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this great PR. I have tested it with 8B and 3B instruct eval and the result looks good. I will merge it now.
What does this PR do?
Resolves #732
Adds in
mmlu
,math
andgpqa
evaluation for llama 3.2-1B and 3.2-3B eval datasets.Pretrain models:
Llama 3.2-1B:
Llama 3.2-3B:
Pretty close to the meta reported numbers:
Instruct Models
Feature/Issue validation/testing
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
[x] Test A
Update
llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml
to:Then run:
Then run the generated command
Logs for Test A:
[x] Test B
Update
llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml
to:Then run:
Then run the generated command
Logs for Test B:
Before submitting
[N/A] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
[N/A] Did you write any new necessary tests? Seems like we're mainly testing the meta_eval harness manually at this stage