Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow aggregated tasks within benchmarks #1771

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Jan 11, 2025

Fixes #1231
Adresses #1763 (though we need a fix for the results repo)

  •  added AbsTaskAggregated
  •  added CQADupstackRetrieval
  • updated mteb(eng, classic) to use CQADupstackRetrieval instead of its subtasks
  • refactor

Did quite a few refactors here. Not at all settled that this is the right representation, but it is at least much better than where we started.

We will still need to combine the score of CQGA scores on embedding-benchmark/results.

@x-tabdeveloping let me know if this works on the leaderboard end (I believe it should)

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

- Updated task filtering adding exclusive_language_filter and hf_subset
- fix bug in MTEB where cross-lingual splits were included
- added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta)

The following code outlines the problems:

```py
import mteb
from mteb.benchmarks import MTEB_ENG_CLASSIC

task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
# was eq. to:
task = mteb.get_task("STS22", languages=["eng"])
task.hf_subsets
# correct filtering to English datasets:
# ['en', 'de-en', 'es-en', 'pl-en', 'zh-en']
# However it should be:
# ['en']

# with the changes it is:
task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
task.hf_subsets
# ['en']
# eq. to
task = mteb.get_task("STS22", hf_subsets=["en"])
# which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits):
task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True)
```
mteb/abstasks/AbsTask.py Show resolved Hide resolved
mteb/abstasks/aggregate_task_metadata.py Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
@@ -334,6 +336,15 @@ def _check_language_code(code):
f"Invalid script code: {script}, you can find valid ISO 15924 codes in {path_to_lang_scripts}"
)

@property
def bcp47_codes(self) -> list[ISO_LANGUAGE_SCRIPT]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you introduce a new method for filtering languages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a new method it is a method for fetching languages in the bcp47 format (eng-Latn as opposed to eng). It is used to compute eval langs for the aggregated task (using just language code breaks the tests)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to standardize how we specify languages #1822, as the current approach is a bit problematic #1821 (comment)

"CQADupstackUnixRetrieval",
"CQADupstackWebmastersRetrieval",
"CQADupstackWordpressRetrieval",
"CQADupstackRetrieval",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe BEIR can be added now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't BEIR just a subset of MTEB(eng, classic)? Any reason to not simply use the retrieval score for MTEB(eng, classic)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but some research still evaluates on BEIR, such as ModernBert. To simplify things, we could add it, as it only requires the benchmark object and could be helpful for (re)evaluating results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think @Samoed has a point here. It would also make researchers' work easier if they had it available out of the box

Copy link
Collaborator

@x-tabdeveloping x-tabdeveloping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good work! I left a couple of comments on some things that cause or might cause errors.


tasks: list[AbsTask]
main_score: str
type: Literal["aggregate-task"] = "aggregate-task"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks task types on the leaderboard, and also the TASK_TYPES type definition. Can't we either make it a property or force people to specify a task type when they introduce an aggregate task?

mteb/abstasks/aggregated_task.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow aggregated tasks within benchmarks
3 participants