Add llama 3.2 mmlu, math, gpqa evals to meta_eval harness #801

aidando73 · 2024-11-23T04:50:57Z

What does this PR do?

Resolves #732

Adds in mmlu, math and gpqa evaluation for llama 3.2-1B and 3.2-3B eval datasets.

Pretrain models:

Llama 3.2-1B:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.3146|±  |0.0039|
|             |       |none  |     0|acc_norm|↑  |0.3146|±  |0.0039|

Llama 3.2-3B:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5643|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5643|±  |0.0042|

Pretty close to the meta reported numbers:

	This mmlu Eval	Meta mmlu Eval
Llama 3.2-1B	0.315	0.317 [1]
Llama 3.2-3B	0.56	0.565 [1]

Instruct Models

	this eval	reported
3.2-1B-Instruct MMLU	0.462	0.485 [1]
3.2-1B-Instruct MATH	0.287	0.304 [1]
3.2-1B-Instruct GPQA	0.257	0.272 [1]

|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
| - meta_math |      1|none        |     0|exact_match|↑  |0.2872|±  |0.0064|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|

	this eval	reported
3.2-3B-Instruct MMLU	0.607	0.637 [1]
3.2-3B-Instruct MATH	0.451	0.475 [1]
3.2-3B-Instruct GPQA	0.333	0.328 [1]

|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

[x] Test A

Update llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml to:

model_name: "meta-llama/Llama-3.2-1B"
evals_dataset: "meta-llama/Llama-3.2-1B-evals"
...

Then run:

python prepare_meta_eval.py --config_path ./eval_config.yaml

Then run the generated command

Logs for Test A:

Command logs: test-A-logs.txt
Evaluation data: https://huggingface.co/datasets/aidando73/llama3-recipe-3.2-evals-mmlu

[x] Test B

Update llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml to:

model_name: "meta-llama/Llama-3.2-3B"
evals_dataset: "meta-llama/Llama-3.2-3B-evals"

Then run:

python prepare_meta_eval.py --config_path ./eval_config.yaml

Then run the generated command

Logs for Test B:

Command logs: test-B-logs.txt
Evaluation data: https://huggingface.co/datasets/aidando73/llama3-recipe-3.2-evals-mmlu

Before submitting

Did you read the contributor guideline, Pull Request section?

[N/A] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.

Did you make sure to update the documentation with your changes?

[N/A] Did you write any new necessary tests? Seems like we're mainly testing the meta_eval harness manually at this stage

facebook-github-bot · 2024-11-23T04:51:03Z

Hi @aidando73!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/mmlu/utils.py

tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml

tools/benchmarks/llm_eval_harness/meta_eval/README.md

aidando73 · 2024-11-23T07:26:26Z

tools/benchmarks/llm_eval_harness/meta_eval/README.md

+- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro
+  - Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
+- **Tasks for 3.2 pretrained models**: MMLU
+  - Chosen because MMLU is a common eval, and is the first one shown on on [llama.com](https://llama.com)


There aren't any tasks for llama 3.2 pretrain that overlap with the hugging face leaderboard:

So I just chose MMLU because it's common and it's the first one that appears on llama.com:

lmk if you have any other preferences here and I can accommodate.

No problem, MMLU will be a good one to have.

aidando73 · 2024-11-23T07:29:35Z

tools/benchmarks/llm_eval_harness/meta_eval/prepare_meta_eval.py

    ]:
        raise ValueError(
-            "The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 Evals collection"
+            "The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 or 3.2 Evals collection. Note that 3.2-Instruct evals are not yet supported."


I'd like to add a few more evals as well, including instruct evals.

If that SGTY, will submit follow-up PRs

Sure, instruct version will be great, I think MMLU, MATH and GPQA are all great tasks to have.

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/mmlu/utils.py

facebook-github-bot · 2024-11-23T08:05:05Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

aidando73 · 2024-11-23T10:05:34Z

Regression Tests

To test regressions ran:

meta-llama/Llama-3.1-8B-evals:

model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals"

|          Tasks          |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_bbh              |      1|strict-match|     0|exact_match|↑  |0.6478|±  |0.0059|
| - meta_mmlu_pro_pretrain|      1|strict-match|     0|exact_match|↑  |0.3561|±  |0.0044|
|meta_pretrain            |N/A    |strict-match|     0|exact_match|↑  |0.4585|±  |0.0035|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_pretrain|N/A    |strict-match|     0|exact_match|↑  |0.4585|±  |0.0035|

meta-llama/Llama-3.1-8B-Instruct-evals:

model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"

|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa             |      1|strict-match|     0|exact_match            |↑  |0.3281|±  |0.0222|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
|                         |       |strict-match|     0|exact_match            |↑  |0.4630|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4680|±  |0.0045|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2372|±  |0.0117|
|             |       |strict-match|     0|exact_match            |↑  |0.4630|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8573|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8118|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|

All working as normal

wukaixingxp · 2024-11-25T01:28:29Z

Hi! @aidando73 Thanks so much for your PR. I am still waiting for your instruction tasks (mmlu,math,gpqa) to be ready before running the final checks. One thing I want to point out is that this eval calculates the micro_avg of mmlu, but we report macro_avg of mmlu in the model card, see more explanation here. Therefore, as shown in our dataset metric for 1b micro_avg: 0.317 and 3b micro_avg: 0.565, your result is very close to the reported number.

	this eval	reported
Llama 3.2-1B	0.315	0.317
Llama 3.2-3B	0.5643	0.565

aidando73 · 2024-11-25T05:06:39Z

One thing I want to point out is that this eval calculates the micro_avg of mmlu, but we report macro_avg of mmlu in the model card, see more explanation here.

Oh I see. So macro_avg is based off of the average across mmlu_humanities, mmlu_stem, ..., mmlu_social_sciences etc, whereas micro_avg is the average across every case. I'm going to assume that micro_avg is good enough for our case, lmk if you want me to implement macro_avg.

I am still waiting for your instruction tasks (mmlu,math,gpqa) to be ready before running the final checks.

Ok sounds good, TODO:

Instruct Eval MMLU
Instruct Eval Math
Instruct Eval GPQA
Update documentation for all tasks

aidando73 · 2024-11-25T06:09:05Z

Instruct MMLU, MATH and GPQA evals

	this eval	reported
3.2-1B-Instruct MMLU	0.462	0.485 [1]
3.2-1B-Instruct MATH	0.287	0.304 [1]
3.2-1B-Instruct GPQA	0.257	0.272 [1]

|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|
| - meta_math |      1|none        |     0|exact_match|↑  |0.2872|±  |0.0064|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|acc_norm   |↑  |0.4618|±  |0.0042|
|             |       |none        |     0|exact_match|↑  |0.2872|±  |0.0064|
|             |       |strict-match|     0|exact_match|↑  |0.2567|±  |0.0207|

	this eval	reported
3.2-3B-Instruct MMLU	0.607	0.637 [1]
3.2-3B-Instruct MATH	0.451	0.475 [1]
3.2-3B-Instruct GPQA	0.333	0.328 [1]

|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

tools/benchmarks/llm_eval_harness/meta_eval/prepare_meta_eval.py

aidando73 · 2024-11-25T09:17:48Z

tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa/gpqa_0shot.yaml

+    filter:
+      - function: "regex"
+        group_select: -1
+        regex_pattern: ' ([A-Z])'


tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa/utils.py

tools/benchmarks/llm_eval_harness/meta_eval/prepare_meta_eval.py

aidando73 · 2024-11-26T03:13:15Z

Regression Tests

To test regressions ran:

meta-llama/Llama-3.1-8B-evals:

model_name: "meta-llama/Llama-3.1-8B"
evals_dataset: "meta-llama/Llama-3.1-8B-evals"

|          Tasks          |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_bbh              |      1|strict-match|     0|exact_match|↑  |0.6483|±  |0.0059|
| - meta_mmlu_pro_pretrain|      1|strict-match|     0|exact_match|↑  |0.3582|±  |0.0044|
|meta_pretrain            |N/A    |strict-match|     0|exact_match|↑  |0.4601|±  |0.0035|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_pretrain|N/A    |strict-match|     0|exact_match|↑  |0.4601|±  |0.0035|

meta-llama/Llama-3.1-8B-Instruct-evals:

model_name: "meta-llama/Llama-3.1-8B-Instruct"
evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals"

|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa_cot         |      1|strict-match|     0|exact_match            |↑  |0.3304|±  |0.0222|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
|                         |       |strict-match|     0|exact_match            |↑  |0.4628|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4678|±  |0.0045|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2387|±  |0.0117|
|             |       |strict-match|     0|exact_match            |↑  |0.4628|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8561|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7394|±  |0.0189|

meta-llama/Llama-3.2-3B-evals:

model_name: "meta-llama/Llama-3.2-3B"
evals_dataset: "meta-llama/Llama-3.2-3B-evals"

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
| - meta_mmlu |      1|none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|

|   Groups    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|-------|------|-----:|--------|---|-----:|---|-----:|
|meta_pretrain|N/A    |none  |     0|acc     |↑  |0.5662|±  |0.0042|
|             |       |none  |     0|acc_norm|↑  |0.5662|±  |0.0042|

All working as normal

aidando73 · 2024-11-26T03:41:17Z

@wukaixingxp this PR is ready for re-review now - thank you for the feedback so far. Lmk if you want any changes.

I've run the 3.1 evaluations and everything is still working

wukaixingxp · 2024-11-26T23:22:55Z

Running 3.1 8B-instruct eval log

(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ python prepare_meta_eval.py 
[nltk_data] Downloading package punkt_tab to /home/kaiwu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
changing ./meta_template/bbh/bbh_3shot_cot.yaml to output_path: ./work_dir/bbh/bbh_3shot_cot.yaml
changing ./meta_template/gpqa_cot/gpqa_0shot_cot.yaml to output_path: ./work_dir/gpqa_cot/gpqa_0shot_cot.yaml
changing ./meta_template/ifeval/ifeval.yaml to output_path: ./work_dir/ifeval/ifeval.yaml
changing ./meta_template/math_hard/math_4shot_cot.yaml to output_path: ./work_dir/math_hard/math_4shot_cot.yaml
changing ./meta_template/math_hard/math_hard_0shot_cot.yaml to output_path: ./work_dir/math_hard/math_hard_0shot_cot.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
changing ./meta_template/gpqa/gpqa_0shot.yaml to output_path: ./work_dir/gpqa/gpqa_0shot.yaml
changing ./meta_template/mmlu/mmlu.yaml to output_path: ./work_dir/mmlu/mmlu.yaml
preparing the math data using Llama-3.2-3B-Instruct's evals dataset
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████| 5/5 [00:00<00:00, 98.29ba/s]
prepration for the meta-llama/Llama-3.2-3B-Instruct using meta-llama/Llama-3.2-3B-Instruct-evals is done, all saved the work_dir: ./work_dir
please use the following command to run the meta reproduce evals:
lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ python prepare_meta_eval.py 
[nltk_data] Downloading package punkt_tab to /home/kaiwu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
changing ./meta_template/bbh/bbh_3shot_cot.yaml to output_path: ./work_dir/bbh/bbh_3shot_cot.yaml
changing ./meta_template/gpqa_cot/gpqa_0shot_cot.yaml to output_path: ./work_dir/gpqa_cot/gpqa_0shot_cot.yaml
changing ./meta_template/ifeval/ifeval.yaml to output_path: ./work_dir/ifeval/ifeval.yaml
changing ./meta_template/math_hard/math_4shot_cot.yaml to output_path: ./work_dir/math_hard/math_4shot_cot.yaml
changing ./meta_template/math_hard/math_hard_0shot_cot.yaml to output_path: ./work_dir/math_hard/math_hard_0shot_cot.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
changing ./meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml to output_path: ./work_dir/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
changing ./meta_template/gpqa/gpqa_0shot.yaml to output_path: ./work_dir/gpqa/gpqa_0shot.yaml
changing ./meta_template/mmlu/mmlu.yaml to output_path: ./work_dir/mmlu/mmlu.yaml
preparing the ifeval data using Llama-3.1-8B-Instruct's evals dataset
README.md: 100%|███████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 295kB/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00, 208.68ba/s]
preparing the math hard data using Llama-3.1-8B-Instruct's evals dataset
README.md: 100%|████████████████████████████████████████████████████████████████████| 4.27k/4.27k [00:00<00:00, 33.8MB/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00, 151.08ba/s]
prepration for the meta-llama/Llama-3.1-8B-Instruct using meta-llama/Llama-3.1-8B-Instruct-evals is done, all saved the work_dir: ./work_dir
please use the following command to run the meta reproduce evals:
lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
(test_12) [[email protected] ~/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval (aidand-732-add-3-2-evals_3)]$ lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples 
2024-11-26:14:42:13,188 INFO     [__main__.py:272] Verbosity set to INFO
2024-11-26:14:42:13,188 INFO     [__main__.py:303] Including path: /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir
2024-11-26:14:42:16,739 INFO     [__main__.py:369] Selected Tasks: ['meta_instruct']
2024-11-26:14:42:16,740 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-11-26:14:42:16,740 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.1-8B-Instruct', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8, 'data_parallel_size': 1, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
INFO 11-26 14:42:21 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 14:42:21 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 14:42:22 selector.py:135] Using Flash Attention backend.
INFO 11-26 14:42:23 model_runner.py:1072] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 11-26 14:42:23 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.27s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.08s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:01,  1.96s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.65s/it]

INFO 11-26 14:42:30 model_runner.py:1077] Loading model weights took 14.9888 GB
INFO 11-26 14:42:31 worker.py:232] Memory profiling results: total_gpu_memory=95.00GiB initial_memory_usage=16.50GiB peak_torch_memory=16.24GiB memory_usage_post_profile=16.60GiB non_torch_memory=1.58GiB kv_cache_size=58.18GiB gpu_memory_utilization=0.80
INFO 11-26 14:42:31 gpu_executor.py:113] # GPU blocks: 29786, # CPU blocks: 2048
INFO 11-26 14:42:31 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 58.18x
INFO 11-26 14:42:32 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-26 14:42:32 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-26 14:42:51 model_runner.py:1518] Graph capturing finished in 19 secs, took 0.90 GiB
Generating train split: 541 examples [00:00, 73500.86 examples/s]
2024-11-26:14:42:53,312 WARNING  [task.py:325] [Task: meta_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:53,312 WARNING  [task.py:325] [Task: meta_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
Generating train split: 1324 examples [00:00, 100999.55 examples/s]
2024-11-26:14:42:53,635 WARNING  [task.py:325] [Task: meta_math_hard] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
Map: 100%|█████████████████████████████████████████████████████████████████| 1324/1324 [00:00<00:00, 13178.65 examples/s]
2024-11-26:14:42:53,747 WARNING  [task.py:325] [Task: meta_math_hard] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,144 WARNING  [task.py:325] [Task: meta_gpqa_cot] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,150 WARNING  [task.py:325] [Task: meta_gpqa_cot] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,900 WARNING  [task.py:325] [Task: meta_mmlu_pro_instruct] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:55,920 WARNING  [task.py:325] [Task: meta_mmlu_pro_instruct] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,237 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:14:42:57,242 INFO     [task.py:411] Building contexts for meta_mmlu_pro_instruct on rank 0...
100%|███████████████████████████████████████████████████████████████████████████| 12032/12032 [00:00<00:00, 95751.93it/s]
2024-11-26:14:42:58,395 INFO     [task.py:411] Building contexts for meta_gpqa_cot on rank 0...
100%|██████████████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 121818.36it/s]
2024-11-26:14:42:58,438 INFO     [task.py:411] Building contexts for meta_math_hard on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████| 1324/1324 [00:00<00:00, 93005.38it/s]
2024-11-26:14:42:58,536 INFO     [task.py:411] Building contexts for meta_ifeval on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 85170.73it/s]
2024-11-26:14:42:58,616 INFO     [evaluator.py:438] Running generate_until requests
Running generate_until requests:   0%|                                                         | 0/14345 [00:00<?, ?it/sWARNING 11-26 14:43:40 scheduler.py:1481] Sequence group 182 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 100%|█| 12032/12032 [20:54<00:00,  9.59it/s, est. speed input: 15218.54 toks/s, output: 2338.43 toks/s
Processed prompts: 100%|█████| 448/448 [01:02<00:00,  7.18it/s, est. speed input: 2388.75 toks/s, output: 4377.03 toks/s]
Running generate_until requests:  87%|███████████████████████████████████████▏     | 12480/14345 [22:15<02:26, 12.77it/sWARNING 11-26 15:08:03 scheduler.py:1481] Sequence group 13384 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
                                                                                                                        WARNING 11-26 15:08:34 scheduler.py:1481] Sequence group 13143 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101
                                                                                                                        WARNING 11-26 15:09:34 scheduler.py:1481] Sequence group 13509 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151
                                                                                                                        WARNING 11-26 15:09:42 scheduler.py:1481] Sequence group 13500 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=201
Processed prompts: 100%|████| 1324/1324 [07:14<00:00,  3.05it/s, est. speed input: 676.72 toks/s, output: 3722.89 toks/s]
Processed prompts: 100%|██████| 541/541 [00:33<00:00, 16.06it/s, est. speed input: 898.17 toks/s, output: 5502.61 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████| 14345/14345 [29:47<00:00,  8.02it/s]
2024-11-26:15:13:29,083 INFO     [evaluation_tracker.py:182] Saving results aggregated
2024-11-26:15:13:29,092 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_gpqa_cot
2024-11-26:15:13:29,127 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_ifeval
2024-11-26:15:13:29,166 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_math_hard
2024-11-26:15:13:29,299 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_mmlu_pro_instruct
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|          Tasks          |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
| - meta_gpqa_cot         |      1|strict-match|     0|exact_match            |↑  |0.3103|±  |0.0219|
| - meta_ifeval           |      2|none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|
|meta_instruct            |N/A    |none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
|                         |       |strict-match|     0|exact_match            |↑  |0.4648|±  |0.0045|
|                         |       |none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|                         |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|                         |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|                         |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|
| - meta_math_hard        |      1|none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
| - meta_mmlu_pro_instruct|      1|strict-match|     0|exact_match            |↑  |0.4706|±  |0.0046|

|   Groups    |Version|   Filter   |n-shot|        Metric         |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------------------|---|-----:|---|------|
|meta_instruct|N/A    |none        |     0|exact_match            |↑  |0.2455|±  |0.0118|
|             |       |strict-match|     0|exact_match            |↑  |0.4648|±  |0.0045|
|             |       |none        |     0|inst_level_loose_acc   |↑  |0.8537|±  |N/A   |
|             |       |none        |     0|inst_level_strict_acc  |↑  |0.8165|±  |N/A   |
|             |       |none        |     0|prompt_level_loose_acc |↑  |0.7948|±  |0.0174|
|             |       |none        |     0|prompt_level_strict_acc|↑  |0.7431|±  |0.0188|

[rank0]:[W1126 15:13:32.278660927 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

wukaixingxp · 2024-11-26T23:23:58Z

Running 3B-instruct eval log

(aidand-732-add-3-2-evals_3)]$ CUDA_VISIBLE_DEVICES=4 lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.3,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir --seed 42  --log_samples
2024-11-26:13:58:22,401 INFO     [__main__.py:272] Verbosity set to INFO
2024-11-26:13:58:22,401 INFO     [__main__.py:303] Including path: /home/kaiwu/work/to_merge/llama-recipes/tools/benchmarks/llm_eval_harness/meta_eval/work_dir
2024-11-26:13:58:25,811 INFO     [__main__.py:369] Selected Tasks: ['meta_instruct']
2024-11-26:13:58:25,812 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-11-26:13:58:25,812 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-3B-Instruct', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.3, 'data_parallel_size': 1, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
INFO 11-26 13:58:30 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 11-26 13:58:30 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=meta-llama/Llama-3.2-3B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 13:58:31 selector.py:135] Using Flash Attention backend.
INFO 11-26 13:58:32 model_runner.py:1072] Starting to load model meta-llama/Llama-3.2-3B-Instruct...
INFO 11-26 13:58:32 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.71it/s]

INFO 11-26 13:58:34 model_runner.py:1077] Loading model weights took 6.0160 GB
INFO 11-26 13:58:34 worker.py:232] Memory profiling results: total_gpu_memory=95.00GiB initial_memory_usage=6.65GiB peak_torch_memory=7.26GiB memory_usage_post_profile=6.75GiB non_torch_memory=0.70GiB kv_cache_size=20.55GiB gpu_memory_utilization=0.30
INFO 11-26 13:58:34 gpu_executor.py:113] # GPU blocks: 12022, # CPU blocks: 2340
INFO 11-26 13:58:34 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 23.48x
INFO 11-26 13:58:36 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-26 13:58:36 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-26 13:58:56 model_runner.py:1518] Graph capturing finished in 20 secs, took 0.79 GiB
2024-11-26:13:58:58,381 WARNING  [task.py:325] [Task: meta_mmlu] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:58,388 WARNING  [task.py:325] [Task: meta_mmlu] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:59,409 WARNING  [task.py:325] [Task: meta_math] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:58:59,414 WARNING  [task.py:325] [Task: meta_math] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,229 WARNING  [task.py:325] [Task: meta_gpqa] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,237 WARNING  [task.py:325] [Task: meta_gpqa] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,289 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-11-26:13:59:01,293 INFO     [task.py:411] Building contexts for meta_gpqa on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████| 448/448 [00:00<00:00, 72060.45it/s]
2024-11-26:13:59:01,348 INFO     [task.py:411] Building contexts for meta_math on rank 0...
100%|████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 140966.05it/s]
2024-11-26:13:59:02,005 INFO     [task.py:411] Building contexts for meta_mmlu on rank 0...
100%|███████████████████████████████████████████████████████████████████████████| 14042/14042 [00:00<00:00, 27156.49it/s]
2024-11-26:13:59:03,316 INFO     [evaluator.py:438] Running generate_until requests
Processed prompts: 100%|████| 448/448 [00:01<00:00, 262.04it/s, est. speed input: 70966.06 toks/s, output: 786.13 toks/s]
Running generate_until requests:   0%|                                                | 1/5448 [00:01<2:37:53,  1.74s/itWARNING 11-26 13:59:21 scheduler.py:1481] Sequence group 529 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 100%|██| 5000/5000 [08:34<00:00,  9.73it/s, est. speed input: 18569.26 toks/s, output: 3308.49 toks/s]
Running generate_until requests: 100%|███████████████████████████████████████████████| 5448/5448 [08:37<00:00, 10.53it/s]
2024-11-26:14:07:52,924 INFO     [evaluator.py:438] Running loglikelihood requests
Processed prompts: 100%|██| 56168/56168 [18:08<00:00, 51.62it/s, est. speed input: 35382.79 toks/s, output: 51.62 toks/s]
Running loglikelihood requests: 100%|██████████████████████████████████████████████| 56168/56168 [18:35<00:00, 50.37it/s]
2024-11-26:14:29:08,593 INFO     [evaluation_tracker.py:182] Saving results aggregated
2024-11-26:14:29:08,607 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_gpqa
2024-11-26:14:29:08,630 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_math
2024-11-26:14:29:09,225 INFO     [evaluation_tracker.py:258] Saving per-sample results for: meta_mmlu
vllm (pretrained=meta-llama/Llama-3.2-3B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.3,data_parallel_size=1,max_model_len=8192,add_bos_token=True,seed=42), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto
|    Tasks    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
| - meta_gpqa |      1|strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|
| - meta_math |      1|none        |     0|exact_match|↑  |0.4514|±  |0.0070|
| - meta_mmlu |      1|none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|

|   Groups    |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|-------------|-------|------------|-----:|-----------|---|-----:|---|-----:|
|meta_instruct|N/A    |none        |     0|acc        |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|acc_norm   |↑  |0.6065|±  |0.0041|
|             |       |none        |     0|exact_match|↑  |0.4514|±  |0.0070|
|             |       |strict-match|     0|exact_match|↑  |0.3326|±  |0.0223|

[rank0]:[W1126 14:29:13.514981629 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

wukaixingxp · 2024-11-26T23:25:44Z

tools/benchmarks/llm_eval_harness/meta_eval/README.md


- **Tasks for pretrained models**: BBH and MMLU-Pro
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
+These tasks are common evalutions, many of which overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)


small typo: evalutions -> evaluations

wukaixingxp

Thank you so much for this great PR. I have tested it with 8B and 3B instruct eval and the result looks good. I will merge it now.

aidando73 marked this pull request as draft November 23, 2024 04:51