Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama 3.2 mmlu, math, gpqa evals to meta_eval harness #801

Merged
merged 4 commits into from
Nov 26, 2024

Conversation

aidando73
Copy link
Contributor

@aidando73 aidando73 commented Nov 23, 2024

What does this PR do?

Resolves #732

Adds in mmlu, math and gpqa evaluation for llama 3.2-1B and 3.2-3B eval datasets.

Pretrain models:

Llama 3.2-1B:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|

Llama 3.2-3B:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|

Pretty close to the meta reported numbers:

This mmlu Eval Meta mmlu Eval
Llama 3.2-1B 0.315 0.317 [1]
Llama 3.2-3B 0.56 0.565 [1]

Instruct Models

this eval reported
3.2-1B-Instruct MMLU 0.462 0.485 [1]
3.2-1B-Instruct MATH 0.287 0.304 [1]
3.2-1B-Instruct GPQA 0.257 0.272 [1]
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
| - meta_math |      1|none        |     0|exact_match|↑  |0.2872|±  |0.0064|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
this eval reported
3.2-3B-Instruct MMLU 0.607 0.637 [1]
3.2-3B-Instruct MATH 0.451 0.475 [1]
3.2-3B-Instruct GPQA 0.333 0.328 [1]
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

[x] Test A

Update llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml to:

model_name: "meta-llama/Llama-3.2-1B"
evals_dataset: "meta-llama/Llama-3.2-1B-evals"
...

Then run:

python prepare_meta_eval.py --config_path ./eval_config.yaml

Then run the generated command

Logs for Test A:

[x] Test B

Update llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml to:

model_name: "meta-llama/Llama-3.2-3B"
evals_dataset: "meta-llama/Llama-3.2-3B-evals"

Then run:

python prepare_meta_eval.py --config_path ./eval_config.yaml

Then run the generated command

Logs for Test B:

Before submitting

[N/A] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.

  • Did you make sure to update the documentation with your changes?

[N/A] Did you write any new necessary tests? Seems like we're mainly testing the meta_eval harness manually at this stage

@facebook-github-bot
Copy link

Hi @aidando73!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@aidando73 aidando73 marked this pull request as draft November 23, 2024 04:51
@aidando73 aidando73 changed the title Draft - adding 3.2 evals to meta_eval Add 3.2 mmlu eval to meta_eval harness Nov 23, 2024
@aidando73 aidando73 changed the title Add 3.2 mmlu eval to meta_eval harness Add llama 3.2 mmlu eval to meta_eval harness Nov 23, 2024
- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro
- Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
- **Tasks for 3.2 pretrained models**: MMLU
- Chosen because MMLU is a common eval, and is the first one shown on on [llama.com](https://llama.com)
Copy link
Contributor Author

@aidando73 aidando73 Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any tasks for llama 3.2 pretrain that overlap with the hugging face leaderboard:

image image

So I just chose MMLU because it's common and it's the first one that appears on llama.com:

image

lmk if you have any other preferences here and I can accommodate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, MMLU will be a good one to have.

]:
raise ValueError(
"The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 Evals collection"
"The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 or 3.2 Evals collection. Note that 3.2-Instruct evals are not yet supported."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add a few more evals as well, including instruct evals.

If that SGTY, will submit follow-up PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, instruct version will be great, I think MMLU, MATH and GPQA are all great tasks to have.

@aidando73 aidando73 force-pushed the aidand-732-add-3-2-evals_3 branch from 7ca3132 to 4f9050f Compare November 23, 2024 07:34
@aidando73 aidando73 marked this pull request as ready for review November 23, 2024 07:59
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@aidando73
Copy link
Contributor Author

aidando73 commented Nov 23, 2024

Regression Tests

To test regressions ran:

meta-llama/Llama-3.1-8B-evals:

model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals" 
|          Tasks          |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_bbh              |      1|strict-match|     0|exact_match|↑  |0.6478|±  |0.0059|
| - meta_mmlu_pro_pretrain|      1|strict-match|     0|exact_match|↑  |0.3561|±  |0.0044|
|meta_pretrain            |N/A    |strict-match|     0|exact_match|↑  |0.4585|±  |0.0035|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_pretrain|N/A    |strict-match|     0|exact_match|↑  |0.4585|±  |0.0035|

meta-llama/Llama-3.1-8B-Instruct-evals:

model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"
|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa             |      1|strict-match|     0|exact_match            |↑  |0.3281|±  |0.0222|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
|                         |       |strict-match|     0|exact_match            |↑  |0.4630|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4680|±  |0.0045|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
|             |       |strict-match|     0|exact_match            |↑  |0.4630|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|

All working as normal

@wukaixingxp
Copy link
Contributor

Hi! @aidando73 Thanks so much for your PR. I am still waiting for your instruction tasks (mmlu,math,gpqa) to be ready before running the final checks. One thing I want to point out is that this eval calculates the micro_avg of mmlu, but we report macro_avg of mmlu in the model card, see more explanation here. Therefore, as shown in our dataset metric for 1b micro_avg: 0.317 and 3b micro_avg: 0.565, your result is very close to the reported number.

this eval reported
Llama 3.2-1B 0.315 0.317
Llama 3.2-3B 0.5643 0.565

@aidando73
Copy link
Contributor Author

aidando73 commented Nov 25, 2024

One thing I want to point out is that this eval calculates the micro_avg of mmlu, but we report macro_avg of mmlu in the model card, see more explanation here.

Oh I see. So macro_avg is based off of the average across mmlu_humanities, mmlu_stem, ..., mmlu_social_sciences etc, whereas micro_avg is the average across every case. I'm going to assume that micro_avg is good enough for our case, lmk if you want me to implement macro_avg.

I am still waiting for your instruction tasks (mmlu,math,gpqa) to be ready before running the final checks.

Ok sounds good, TODO:

  • Instruct Eval MMLU
  • Instruct Eval Math
  • Instruct Eval GPQA
  • Update documentation for all tasks

@aidando73
Copy link
Contributor Author

aidando73 commented Nov 25, 2024

Instruct MMLU, MATH and GPQA evals

this eval reported
3.2-1B-Instruct MMLU 0.462 0.485 [1]
3.2-1B-Instruct MATH 0.287 0.304 [1]
3.2-1B-Instruct GPQA 0.257 0.272 [1]
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
| - meta_math |      1|none        |     0|exact_match|↑  |0.2872|±  |0.0064|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
this eval reported
3.2-3B-Instruct MMLU 0.607 0.637 [1]
3.2-3B-Instruct MATH 0.451 0.475 [1]
3.2-3B-Instruct GPQA 0.333 0.328 [1]
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

@aidando73 aidando73 changed the title Add llama 3.2 mmlu eval to meta_eval harness Add llama 3.2 mmlu and gpqa evals to meta_eval harness Nov 25, 2024
filter:
- function: "regex"
group_select: -1
regex_pattern: ' ([A-Z])'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

histogram

@aidando73 aidando73 changed the title Add llama 3.2 mmlu and gpqa evals to meta_eval harness Add llama 3.2 mmlu, math, gpqa evals to meta_eval harness Nov 25, 2024
@aidando73 aidando73 force-pushed the aidand-732-add-3-2-evals_3 branch from 33a1f54 to ab1b145 Compare November 26, 2024 02:47
@aidando73
Copy link
Contributor Author

aidando73 commented Nov 26, 2024

Regression Tests

To test regressions ran:

meta-llama/Llama-3.1-8B-evals:

model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals" 
|          Tasks          |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_bbh              |      1|strict-match|     0|exact_match|↑  |0.6483|±  |0.0059|
| - meta_mmlu_pro_pretrain|      1|strict-match|     0|exact_match|↑  |0.3582|±  |0.0044|
|meta_pretrain            |N/A    |strict-match|     0|exact_match|↑  |0.4601|±  |0.0035|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_pretrain|N/A    |strict-match|     0|exact_match|↑  |0.4601|±  |0.0035|

meta-llama/Llama-3.1-8B-Instruct-evals:

model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"
|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa_cot         |      1|strict-match|     0|exact_match            |↑  |0.3304|±  |0.0222|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
|                         |       |strict-match|     0|exact_match            |↑  |0.4628|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4678|±  |0.0045|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
|             |       |strict-match|     0|exact_match            |↑  |0.4628|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|

meta-llama/Llama-3.2-3B-evals:

model_name: "meta-llama/Llama-3.2-3B"
evals_dataset: "meta-llama/Llama-3.2-3B-evals"
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|

All working as normal

@aidando73
Copy link
Contributor Author

aidando73 commented Nov 26, 2024

@wukaixingxp this PR is ready for re-review now - thank you for the feedback so far. Lmk if you want any changes.

I've run the 3.1 evaluations and everything is still working

@wukaixingxp
Copy link
Contributor

Running 3.1 8B-instruct eval log

(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ python prepare_meta_eval.py 
[nltk_data] Downloading package punkt_tab to /home/kaiwu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
changing ./meta_template/bbh/bbh_3shot_cot.yaml to output_path: ./work_dir/bbh/bbh_3shot_cot.yaml
changing ./meta_template/gpqa_cot/gpqa_0shot_cot.yaml to output_path: ./work_dir/gpqa_cot/gpqa_0shot_cot.yaml
changing ./meta_template/ifeval/ifeval.yaml to output_path: ./work_dir/ifeval/ifeval.yaml
changing ./meta_template/math_hard/math_4shot_cot.yaml to output_path: ./work_dir/math_hard/math_4shot_cot.yaml
changing ./meta_template/math_hard/math_hard_0shot_cot.yaml to output_path: ./work_dir/math_hard/math_hard_0shot_cot.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
changing ./meta_template/gpqa/gpqa_0shot.yaml to output_path: ./work_dir/gpqa/gpqa_0shot.yaml
changing ./meta_template/mmlu/mmlu.yaml to output_path: ./work_dir/mmlu/mmlu.yaml
preparing the math data using Llama-3.2-3B-Instruct's evals dataset
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████| 5/5 [00:00<00:00, 98.29ba/s]
prepration for the meta-llama/Llama-3.2-3B-Instruct using meta-llama/Llama-3.2-3B-Instruct-evals is done, all saved the work_dir: ./work_dir
please use the following command to run the meta reproduce evals:
lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ python prepare_meta_eval.py 
[nltk_data] Downloading package punkt_tab to /home/kaiwu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
changing ./meta_template/bbh/bbh_3shot_cot.yaml to output_path: ./work_dir/bbh/bbh_3shot_cot.yaml
changing ./meta_template/gpqa_cot/gpqa_0shot_cot.yaml to output_path: ./work_dir/gpqa_cot/gpqa_0shot_cot.yaml
changing ./meta_template/ifeval/ifeval.yaml to output_path: ./work_dir/ifeval/ifeval.yaml
changing ./meta_template/math_hard/math_4shot_cot.yaml to output_path: ./work_dir/math_hard/math_4shot_cot.yaml
changing ./meta_template/math_hard/math_hard_0shot_cot.yaml to output_path: ./work_dir/math_hard/math_hard_0shot_cot.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
changing ./meta_template/gpqa/gpqa_0shot.yaml to output_path: ./work_dir/gpqa/gpqa_0shot.yaml
changing ./meta_template/mmlu/mmlu.yaml to output_path: ./work_dir/mmlu/mmlu.yaml
preparing the ifeval data using Llama-3.1-8B-Instruct's evals dataset
README.md: 100%|███████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 295kB/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 208.68ba/s]
preparing the math hard data using Llama-3.1-8B-Instruct's evals dataset
README.md: 100%|████████████████████████████████████████████████████████████████████| 4.27k/4.27k [00:00<00:00, 33.8MB/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00, 151.08ba/s]
prepration for the meta-llama/Llama-3.1-8B-Instruct using meta-llama/Llama-3.1-8B-Instruct-evals is done, all saved the work_dir: ./work_dir
please use the following command to run the meta reproduce evals:
lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
2024-11-26:14:42:13,188 INFO     [__main__.py:272] Verbosity set to INFO
2024-11-26:14:42:13,188 INFO     [__main__.py:303] Including path: /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir
2024-11-26:14:42:16,739 INFO     [__main__.py:369] Selected Tasks: ['meta_instruct']
2024-11-26:14:42:16,740 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-11-26:14:42:16,740 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.1-8B-Instruct', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8, 'data_parallel_size': 1, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
INFO 11-26 14:42:21 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 14:42:21 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 14:42:22 selector.py:135] Using Flash Attention backend.
INFO 11-26 14:42:23 model_runner.py:1072] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-26 14:42:23 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.27s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.08s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:01,  1.96s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.65s/it]

INFO 11-26 14:42:30 model_runner.py:1077] Loading model weights took 14.9888 GB
INFO 11-26 14:42:31 worker.py:232] Memory profiling results: total_gpu_memory=95.00GiB initial_memory_usage=16.50GiB peak_torch_memory=16.24GiB memory_usage_post_profile=16.60GiB non_torch_memory=1.58GiB kv_cache_size=58.18GiB gpu_memory_utilization=0.80
INFO 11-26 14:42:31 gpu_executor.py:113] # GPU blocks: 29786, # CPU blocks: 2048
INFO 11-26 14:42:31 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 58.18x
INFO 11-26 14:42:32 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-26 14:42:32 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-26 14:42:51 model_runner.py:1518] Graph capturing finished in 19 secs, took 0.90 GiB
Generating train split: 541 examples [00:00, 73500.86 examples/s]
2024-11-26:14:42:53,312 WARNING  [task.py:325] [Task: meta_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:53,312 WARNING  [task.py:325] [Task: meta_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
Generating train split: 1324 examples [00:00, 100999.55 examples/s]
2024-11-26:14:42:53,635 WARNING  [task.py:325] [Task: meta_math_hard] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
Map: 100%|█████████████████████████████████████████████████████████████████| 1324/1324 [00:00<00:00, 13178.65 examples/s]
2024-11-26:14:42:53,747 WARNING  [task.py:325] [Task: meta_math_hard] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,144 WARNING  [task.py:325] [Task: meta_gpqa_cot] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,150 WARNING  [task.py:325] [Task: meta_gpqa_cot] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,900 WARNING  [task.py:325] [Task: meta_mmlu_pro_instruct] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,920 WARNING  [task.py:325] [Task: meta_mmlu_pro_instruct] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,242 INFO     [task.py:411] Building contexts for meta_mmlu_pro_instruct on rank 0...
100%|███████████████████████████████████████████████████████████████████████████| 12032/12032 [00:00<00:00, 95751.93it/s]
2024-11-26:14:42:58,395 INFO     [task.py:411] Building contexts for meta_gpqa_cot on rank 0...
100%|██████████████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 121818.36it/s]
2024-11-26:14:42:58,438 INFO     [task.py:411] Building contexts for meta_math_hard on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████| 1324/1324 [00:00<00:00, 93005.38it/s]
2024-11-26:14:42:58,536 INFO     [task.py:411] Building contexts for meta_ifeval on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 85170.73it/s]
2024-11-26:14:42:58,616 INFO     [evaluator.py:438] Running generate_until requests
Running generate_until requests:   0%|                                                         | 0/14345 [00:00<?, ?it/sWARNING 11-26 14:43:40 scheduler.py:1481] Sequence group 182 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 100%|█| 12032/12032 [20:54<00:00,  9.59it/s, est. speed input: 15218.54 toks/s, output: 2338.43 toks/s
Processed prompts: 100%|█████| 448/448 [01:02<00:00,  7.18it/s, est. speed input: 2388.75 toks/s, output: 4377.03 toks/s]
Running generate_until requests:  87%|███████████████████████████████████████▏     | 12480/14345 [22:15<02:26, 12.77it/sWARNING 11-26 15:08:03 scheduler.py:1481] Sequence group 13384 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
                                                                                                                        WARNING 11-26 15:08:34 scheduler.py:1481] Sequence group 13143 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101
                                                                                                                        WARNING 11-26 15:09:34 scheduler.py:1481] Sequence group 13509 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151
                                                                                                                        WARNING 11-26 15:09:42 scheduler.py:1481] Sequence group 13500 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=201
Processed prompts: 100%|████| 1324/1324 [07:14<00:00,  3.05it/s, est. speed input: 676.72 toks/s, output: 3722.89 toks/s]
Processed prompts: 100%|██████| 541/541 [00:33<00:00, 16.06it/s, est. speed input: 898.17 toks/s, output: 5502.61 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████| 14345/14345 [29:47<00:00,  8.02it/s]
2024-11-26:15:13:29,083 INFO     [evaluation_tracker.py:182] Saving results aggregated
2024-11-26:15:13:29,092 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_gpqa_cot
2024-11-26:15:13:29,127 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_ifeval
2024-11-26:15:13:29,166 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_math_hard
2024-11-26:15:13:29,299 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_mmlu_pro_instruct
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa_cot         |      1|strict-match|     0|exact_match            |↑  |0.3103|±  |0.0219|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
|                         |       |strict-match|     0|exact_match            |↑  |0.4648|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4706|±  |0.0046|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
|             |       |strict-match|     0|exact_match            |↑  |0.4648|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|

[rank0]:[W1126 15:13:32.278660927 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

@wukaixingxp
Copy link
Contributor

Running 3B-instruct eval log

(aidand-732-add-3-2-evals_3)]$ CUDA_VISIBLE_DEVICES=4 lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.3,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples
2024-11-26:13:58:22,401 INFO     [__main__.py:272] Verbosity set to INFO
2024-11-26:13:58:22,401 INFO     [__main__.py:303] Including path: /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir
2024-11-26:13:58:25,811 INFO     [__main__.py:369] Selected Tasks: ['meta_instruct']
2024-11-26:13:58:25,812 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-11-26:13:58:25,812 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-3B-Instruct', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.3, 'data_parallel_size': 1, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
INFO 11-26 13:58:30 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 13:58:30 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=meta-llama/Llama-3.2-3B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 13:58:31 selector.py:135] Using Flash Attention backend.
INFO 11-26 13:58:32 model_runner.py:1072] Starting to load model meta-llama/Llama-3.2-3B-Instruct...
INFO 11-26 13:58:32 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.71it/s]

INFO 11-26 13:58:34 model_runner.py:1077] Loading model weights took 6.0160 GB
INFO 11-26 13:58:34 worker.py:232] Memory profiling results: total_gpu_memory=95.00GiB initial_memory_usage=6.65GiB peak_torch_memory=7.26GiB memory_usage_post_profile=6.75GiB non_torch_memory=0.70GiB kv_cache_size=20.55GiB gpu_memory_utilization=0.30
INFO 11-26 13:58:34 gpu_executor.py:113] # GPU blocks: 12022, # CPU blocks: 2340
INFO 11-26 13:58:34 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 23.48x
INFO 11-26 13:58:36 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-26 13:58:36 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-26 13:58:56 model_runner.py:1518] Graph capturing finished in 20 secs, took 0.79 GiB
2024-11-26:13:58:58,381 WARNING  [task.py:325] [Task: meta_mmlu] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:58,388 WARNING  [task.py:325] [Task: meta_mmlu] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:59,409 WARNING  [task.py:325] [Task: meta_math] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:59,414 WARNING  [task.py:325] [Task: meta_math] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,229 WARNING  [task.py:325] [Task: meta_gpqa] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,237 WARNING  [task.py:325] [Task: meta_gpqa] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,293 INFO     [task.py:411] Building contexts for meta_gpqa on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 72060.45it/s]
2024-11-26:13:59:01,348 INFO     [task.py:411] Building contexts for meta_math on rank 0...
100%|████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 140966.05it/s]
2024-11-26:13:59:02,005 INFO     [task.py:411] Building contexts for meta_mmlu on rank 0...
100%|███████████████████████████████████████████████████████████████████████████| 14042/14042 [00:00<00:00, 27156.49it/s]
2024-11-26:13:59:03,316 INFO     [evaluator.py:438] Running generate_until requests
Processed prompts: 100%|████| 448/448 [00:01<00:00, 262.04it/s, est. speed input: 70966.06 toks/s, output: 786.13 toks/s]
Running generate_until requests:   0%|                                                | 1/5448 [00:01<2:37:53,  1.74s/itWARNING 11-26 13:59:21 scheduler.py:1481] Sequence group 529 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 100%|██| 5000/5000 [08:34<00:00,  9.73it/s, est. speed input: 18569.26 toks/s, output: 3308.49 toks/s]
Running generate_until requests: 100%|███████████████████████████████████████████████| 5448/5448 [08:37<00:00, 10.53it/s]
2024-11-26:14:07:52,924 INFO     [evaluator.py:438] Running loglikelihood requests
Processed prompts: 100%|██| 56168/56168 [18:08<00:00, 51.62it/s, est. speed input: 35382.79 toks/s, output: 51.62 toks/s]
Running loglikelihood requests: 100%|██████████████████████████████████████████████| 56168/56168 [18:35<00:00, 50.37it/s]
2024-11-26:14:29:08,593 INFO     [evaluation_tracker.py:182] Saving results aggregated
2024-11-26:14:29:08,607 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_gpqa
2024-11-26:14:29:08,630 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_math
2024-11-26:14:29:09,225 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_mmlu
vllm (pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.3,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

[rank0]:[W1126 14:29:13.514981629 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())


- **Tasks for pretrained models**: BBH and MMLU-Pro
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
These tasks are common evalutions, many of which overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo: evalutions -> evaluations

Copy link
Contributor

@wukaixingxp wukaixingxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this great PR. I have tested it with 8B and 3B instruct eval and the result looks good. I will merge it now.

@wukaixingxp wukaixingxp merged commit 7ebd68c into meta-llama:main Nov 26, 2024
2 of 4 checks passed
@wukaixingxp wukaixingxp self-assigned this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add eval code for LLaMA 3.2 text model
3 participants