fix: Allow aggregated tasks within benchmarks #1771

KennethEnevoldsen · 2025-01-11T21:45:05Z

Fixes #1231
Adresses #1763 (though we need a fix for the results repo)

added AbsTaskAggregated
added CQADupstackRetrieval
updated mteb(eng, classic) to use CQADupstackRetrieval instead of its subtasks
refactor

Did quite a few refactors here. Not at all settled that this is the right representation, but it is at least much better than where we started.

We will still need to combine the score of CQGA scores on embedding-benchmark/results.

@x-tabdeveloping let me know if this works on the leaderboard end (I believe it should)

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Fixes #1231

mteb/abstasks/aggregated_task.py

- Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ```

…gregated-tasks-within-benchmarks

Fixes #1231

…low-aggregated-tasks-within-benchmarks

mteb/abstasks/AbsTask.py

mteb/abstasks/aggregate_task_metadata.py

mteb/evaluation/MTEB.py

Fixes #1231

mteb/load_results/task_results.py

mteb/models/overview.py

Samoed · 2025-01-17T14:25:12Z

mteb/abstasks/TaskMetadata.py

@@ -334,6 +336,15 @@ def _check_language_code(code):
                f"Invalid script code: {script}, you can find valid ISO 15924 codes in {path_to_lang_scripts}"
            )

+    @property
+    def bcp47_codes(self) -> list[ISO_LANGUAGE_SCRIPT]:


Why did you introduce a new method for filtering languages?

It is not a new method it is a method for fetching languages in the bcp47 format (eng-Latn as opposed to eng). It is used to compute eval langs for the aggregated task (using just language code breaks the tests)

Maybe we need to standardize how we specify languages #1822, as the current approach is a bit problematic #1821 (comment)

Samoed · 2025-01-17T14:27:55Z

mteb/benchmarks/benchmarks.py

-                "CQADupstackUnixRetrieval",
-                "CQADupstackWebmastersRetrieval",
-                "CQADupstackWordpressRetrieval",
+                "CQADupstackRetrieval",


Maybe BEIR can be added now?

Isn't BEIR just a subset of MTEB(eng, classic)? Any reason to not simply use the retrieval score for MTEB(eng, classic)

Yes, but some research still evaluates on BEIR, such as ModernBert. To simplify things, we could add it, as it only requires the benchmark object and could be helpful for (re)evaluating results.

Yea I think @Samoed has a point here. It would also make researchers' work easier if they had it available out of the box

x-tabdeveloping

Thanks for the good work! I left a couple of comments on some things that cause or might cause errors.

x-tabdeveloping · 2025-01-17T14:20:07Z

mteb/abstasks/aggregate_task_metadata.py

+
+    tasks: list[AbsTask]
+    main_score: str
+    type: Literal["aggregate-task"] = "aggregate-task"


This breaks task types on the leaderboard, and also the TASK_TYPES type definition. Can't we either make it a property or force people to specify a task type when they introduce an aggregate task?

mteb/abstasks/aggregated_task.py

mteb/evaluation/MTEB.py

…nto KennethEnevoldsen/issue-Allow-aggregated-tasks-within-benchmarks

fix: Allow aggregated tasks within benchmarks

1338736

Fixes #1231

Samoed reviewed Jan 11, 2025

View reviewed changes

mteb/abstasks/aggregated_task.py Show resolved Hide resolved

KennethEnevoldsen added 2 commits January 13, 2025 16:44

format

12aaa97

KennethEnevoldsen mentioned this pull request Jan 13, 2025

Leaderboard: SFR-Embedding results don't match between old and new #1754

Open

KennethEnevoldsen added 6 commits January 13, 2025 17:14

remove "en-ext" from AmazonCounterfactualClassification

1be8ed8

fixed mteb(deu)

8aab5d0

fix: simplify in a few areas

4dfe2ec

wip

cd87ebb

Merge branch 'correct-mteb-eng' into KennethEnevoldsen/issue-Allow-ag…

450953d

…gregated-tasks-within-benchmarks

tmp

87816f1

KennethEnevoldsen mentioned this pull request Jan 15, 2025

[v2] Remove deprecated parameters from MTEB and cli #1773

Merged

2 tasks

KennethEnevoldsen added 5 commits January 16, 2025 16:12

sav

f73ffb7

Allow aggregated tasks within benchmarks

33578ec

Fixes #1231

Merge remote-tracking branch 'origin' into KennethEnevoldsen/issue-Al…

54d16f9

…low-aggregated-tasks-within-benchmarks

ensure correct formatting of eval_langs

b11f6b1

ignore aggregate dataset

0718389

KennethEnevoldsen requested a review from Samoed January 17, 2025 13:05

KennethEnevoldsen added 3 commits January 17, 2025 14:08

clean up dummy cases

2bc375c

add to mteb(eng, classic)

5a9bd8c

format

36cee38

KennethEnevoldsen requested a review from x-tabdeveloping January 17, 2025 13:11

clean up

8bb9026

KennethEnevoldsen commented Jan 17, 2025

View reviewed changes

mteb/abstasks/AbsTask.py Show resolved Hide resolved

mteb/abstasks/aggregate_task_metadata.py Show resolved Hide resolved

mteb/evaluation/MTEB.py Show resolved Hide resolved

mteb/evaluation/MTEB.py Show resolved Hide resolved

Allow aggregated tasks within benchmarks

60a8f0f

Fixes #1231

x-tabdeveloping reviewed Jan 17, 2025

View reviewed changes

mteb/load_results/task_results.py Outdated Show resolved Hide resolved

x-tabdeveloping reviewed Jan 17, 2025

View reviewed changes

mteb/load_results/task_results.py Show resolved Hide resolved

x-tabdeveloping reviewed Jan 17, 2025

View reviewed changes

mteb/models/overview.py Outdated Show resolved Hide resolved

Samoed reviewed Jan 17, 2025

View reviewed changes

x-tabdeveloping reviewed Jan 17, 2025

View reviewed changes

KennethEnevoldsen added 5 commits January 19, 2025 12:34

added fixed from comments

f65c68e

Merge branch 'main' of https://github.com/embeddings-benchmark/mteb i…

76d511c

…nto KennethEnevoldsen/issue-Allow-aggregated-tasks-within-benchmarks

fix merge

14f3ae1

format

66fb570

Updated task type

063e357

Samoed mentioned this pull request Jan 26, 2025

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow aggregated tasks within benchmarks #1771

fix: Allow aggregated tasks within benchmarks #1771

KennethEnevoldsen commented Jan 11, 2025 •

edited

Loading

Samoed Jan 17, 2025

KennethEnevoldsen Jan 19, 2025

Samoed Jan 19, 2025

Samoed Jan 17, 2025

KennethEnevoldsen Jan 19, 2025

Samoed Jan 19, 2025

x-tabdeveloping Jan 20, 2025

x-tabdeveloping left a comment

x-tabdeveloping Jan 17, 2025

fix: Allow aggregated tasks within benchmarks #1771

Are you sure you want to change the base?

fix: Allow aggregated tasks within benchmarks #1771

Conversation

KennethEnevoldsen commented Jan 11, 2025 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x-tabdeveloping left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KennethEnevoldsen commented Jan 11, 2025 •

edited

Loading