diff --git a/tools/benchmarks/llm_eval_harness/meta_eval/README.md b/tools/benchmarks/llm_eval_harness/meta_eval/README.md index 3388085b2..08a7973e6 100644 --- a/tools/benchmarks/llm_eval_harness/meta_eval/README.md +++ b/tools/benchmarks/llm_eval_harness/meta_eval/README.md @@ -43,10 +43,14 @@ It is recommended to read the dataset card to understand the meaning of each col ### Task Selection -Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following: - -- **Tasks for pretrained models**: BBH and MMLU-Pro -- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro +Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), a subset of tasks are chosen: + +- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro + - Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) +- **Tasks for 3.2 pretrained models**: MMLU + - MMLU is a common eval, and is the first one shown on on the [llama website](https://llama.com) +- **Tasks for 3.1 instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro + - Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and get more eval metrics.