Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide some details of the evaluation code for the reported results? #1

Open
nomadlx opened this issue May 24, 2023 · 1 comment

Comments

@nomadlx
Copy link

nomadlx commented May 24, 2023

This assessment data also appears to be in the form of multiple choice tasks similar to MMLU, but there are many detailed differences in the practice of MMLU, and these detailed differences have a significant impact on the quality outcome value. Among them, the accuracy calculation method provided by MMLU is based on the probability normalization of four options, from which the maximum probability is selected as the prediction result. However, many others have changed it to the generated form and then extracted the ABCD option from the generated answer, and the prompt setting and the extraction method of the answer will affect the final result.

So what are the evaluation code details based on which the results table is reported in your repository?

@cordercorder
Copy link
Collaborator

Thank you for your interest in the M3KE dataset. As you mentioned, different evaluation methods can lead to significant differences in experimental results. We attempted to select the label with the highest probability from four choices as the final answer. However, we found that LLMs such as BLOOM-7b1 typically only choose one label, even when the question differs in zero and few-shot scenarios. As a result, we decided to extract ABCD from the model generation and limit the maximum generation length as short as possible. If more than one label is present in the generation, we consider the model's answer to be incorrect.

We plan to make the questions of the M3KE dataset publicly available before the end of June. We would be greatly appreciate if you use the M3KE dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants