Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
aidando73 committed Nov 23, 2024
1 parent 625a846 commit b6a7161
Showing 1 changed file with 8 additions and 4 deletions.
12 changes: 8 additions & 4 deletions tools/benchmarks/llm_eval_harness/meta_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,14 @@ It is recommended to read the dataset card to understand the meaning of each col

### Task Selection

Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), here we will focus on tasks that overlap with the popular Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) as shown in the following:

- **Tasks for pretrained models**: BBH and MMLU-Pro
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
Given the extensive number of tasks available (12 for pretrained models and 30 for instruct models), a subset of tasks are chosen:

- **Tasks for 3.1 pretrained models**: BBH and MMLU-Pro
- Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
- **Tasks for 3.2 pretrained models**: MMLU
- MMLU is a common eval, and is the first one shown on on the [llama website](https://llama.com)
- **Tasks for 3.1 instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
- Chosen as they overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)

Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and get more eval metrics.

Expand Down

0 comments on commit b6a7161

Please sign in to comment.