-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow aggregated tasks within benchmarks #1771
base: main
Are you sure you want to change the base?
fix: Allow aggregated tasks within benchmarks #1771
Conversation
- Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ```
…low-aggregated-tasks-within-benchmarks
@@ -334,6 +336,15 @@ def _check_language_code(code): | |||
f"Invalid script code: {script}, you can find valid ISO 15924 codes in {path_to_lang_scripts}" | |||
) | |||
|
|||
@property | |||
def bcp47_codes(self) -> list[ISO_LANGUAGE_SCRIPT]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you introduce a new method for filtering languages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a new method it is a method for fetching languages in the bcp47 format (eng-Latn as opposed to eng). It is used to compute eval langs for the aggregated task (using just language code breaks the tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to standardize how we specify languages #1822, as the current approach is a bit problematic #1821 (comment)
"CQADupstackUnixRetrieval", | ||
"CQADupstackWebmastersRetrieval", | ||
"CQADupstackWordpressRetrieval", | ||
"CQADupstackRetrieval", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe BEIR
can be added now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't BEIR just a subset of MTEB(eng, classic)? Any reason to not simply use the retrieval score for MTEB(eng, classic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but some research still evaluates on BEIR
, such as ModernBert
. To simplify things, we could add it, as it only requires the benchmark object and could be helpful for (re)evaluating results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think @Samoed has a point here. It would also make researchers' work easier if they had it available out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good work! I left a couple of comments on some things that cause or might cause errors.
|
||
tasks: list[AbsTask] | ||
main_score: str | ||
type: Literal["aggregate-task"] = "aggregate-task" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks task types on the leaderboard, and also the TASK_TYPES
type definition. Can't we either make it a property or force people to specify a task type when they introduce an aggregate task?
…nto KennethEnevoldsen/issue-Allow-aggregated-tasks-within-benchmarks
Fixes #1231
Adresses #1763 (though we need a fix for the results repo)
Did quite a few refactors here. Not at all settled that this is the right representation, but it is at least much better than where we started.
We will still need to combine the score of CQGA scores on
embedding-benchmark/results
.@x-tabdeveloping let me know if this works on the leaderboard end (I believe it should)
Checklist
make test
.make lint
.