.

meta-llama · Nov 23, 2024 · b6a7161 · b6a7161
1 parent 625a846
commit b6a7161
Showing 1 changed file with 8 additions and 4 deletions.
diff --git a/tools/benchmarks/llm_eval_harness/meta_eval/README.md b/tools/benchmarks/llm_eval_harness/meta_eval/README.md
@@ -43,10 +43,14 @@ It is recommended to read the dataset card to understand the meaning of each col
 
 ### Task Selection
 
-Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following:
-
-- **Tasks for pretrained models**: BBH and MMLU-Pro
-- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
+Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), a subset of tasks are chosen:
+
+- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro
+  - Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
+- **Tasks for 3.2 pretrained models**: MMLU
+  - MMLU is a common eval, and is the first one shown on on the [llama website](https://llama.com)
+- **Tasks for 3.1 instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
+  - Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
 
 Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and get more eval metrics.