Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

mehran-sarmadi · 2025-01-20T21:30:11Z

We are a research team from Sharif University of Technology and MCINext Company developing a text embedding benchmark for the Persian language based on MTEB. So far, we have gathered around 63 datasets spanning 7 tasks (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summary Retrieval), including a mix of existing, translated, and newly generated datasets. Notably, we are introducing the Summary Retrieval task for the first time, which focuses on identifying the correct summary of a paragraph from a set of candidates. We have also evaluated several Persian language models and text embeddings that support Persian for this benchmark.

We also open related PR for the results and leaderboard tab, and we are finalizing a paper on this work, which will be published in the near future.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py

mteb/evaluation/evaluators/SummaryRetrievalEvaluator.py

Samoed

Great addition! Can you add mock task of AbsTaskSummaryRetrieval task to https://github.com/embeddings-benchmark/mteb/blob/main/tests/test_benchmark/mock_tasks.py?

mteb/abstasks/AbsTaskSummaryRetrieval.py

Samoed · 2025-01-25T20:28:51Z

mteb/abstasks/AbsTaskSummaryRetrieval.py

+    unique_summary: int
+
+
+class AbsTaskSummaryRetrieval(AbsTask):


Maybe you could inherit from AbsTaskBitextmining to reuse the evaluate function?

This task is different from summarization task it involves identifying the exact summary of a given paragraph from a set of potential summaries. A model embeds sentences and determines the closest pairs using cosine similarity.

Yes, I know. It seems that you don't have parallel datasets in your tasks, maybe you can use evaluate from AbsTask?

This task is very similar to Bitextmining, so I think I can inherit the evaluate method from it. I can also do the same for the _similarity_search method in the SummaryRetrievalEvaluator.

Samoed · 2025-01-25T20:30:01Z

mteb/abstasks/AbsTaskSummaryRetrieval.py

+                scores[hf_subet] = self._evaluate_subset(
+                    model,
+                    data_split,  # type: ignore
+                    subsets=["text", "summary"],


You don't use it anywhere

Suggested change

subsets=["text", "summary"],

Samoed · 2025-01-25T20:32:03Z

mteb/tasks/Classification/fas/FaMTEBClassification.py

+        domains=[],
+        task_subtypes=[],
+        license="not specified",


Can you fill more information about tasks?

Yes its ongoing

mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py

Samoed · 2025-01-25T20:37:23Z

Maybe we should move this PR to v2 branch?

mehran-sarmadi · 2025-01-26T12:22:57Z

Maybe we should move this PR to v2 branch?

I haven’t checked the next version yet, so I’m not sure if any changes are needed. If needed, I’ll make the updates.

add data domain and subtask description

Samoed · 2025-01-26T14:07:29Z

mteb/benchmarks/benchmarks.py

+            "CQADupstackAndroidRetrieval-Fa",
+            "CQADupstackEnglishRetrieval-Fa",
+            "CQADupstackGamingRetrieval-Fa",
+            "CQADupstackGisRetrieval-Fa",
+            "CQADupstackMathematicaRetrieval-Fa",
+            "CQADupstackPhysicsRetrieval-Fa",
+            "CQADupstackProgrammersRetrieval-Fa",
+            "CQADupstackStatsRetrieval-Fa",
+            "CQADupstackTexRetrieval-Fa",
+            "CQADupstackUnixRetrieval-Fa",
+            "CQADupstackWebmastersRetrieval-Fa",


Will you add these datasets? Also FYI #1771

mehran added 9 commits January 24, 2025 17:58

Add Summary Retrieval Task

359a056

Add FaMTEBClassification

fdb1ce5

Add FaMTEBClustering

ae97333

Add FaMTEBPairClassification

eab993d

Add FaMTEBRetrieval and BEIRFA and FaMTEBSTS

f138440

Add FaMTEBSummaryRetrieval

1881e66

Add FaMTEB to benchmarks

fc34b77

fix benchmark names

a944ef4

temporary fix metadata

c57293f

mehran-sarmadi force-pushed the fa-mteb-v1 branch from 37b50d7 to c57293f Compare January 24, 2025 14:31

Fix dataset revisions

7624d61

This was referenced Jan 25, 2025

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) embeddings-benchmark/results#92

Open

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) embeddings-benchmark/leaderboard#67

Draft

Samoed reviewed Jan 25, 2025

View reviewed changes

mteb/tasks/SummaryRetrieval/fas/FaMTEBSummaryRetrieval.py Outdated Show resolved Hide resolved

Fix conflict

5fe3730

Samoed reviewed Jan 25, 2025

View reviewed changes

mteb/evaluation/evaluators/SummaryRetrievalEvaluator.py Outdated Show resolved Hide resolved

mehran-sarmadi force-pushed the fa-mteb-v1 branch from 946ee59 to 5fe3730 Compare January 25, 2025 09:30

mehran added 3 commits January 25, 2025 14:42

Update SummaryRetrievalEvaluator.py

afba8d9

Update task files

5c382a5

Update task files

37f7a4c

mehran-sarmadi marked this pull request as ready for review January 25, 2025 14:41

Samoed requested review from x-tabdeveloping, KennethEnevoldsen and isaac-chung January 25, 2025 20:25

Samoed reviewed Jan 25, 2025

View reviewed changes

Merge branch 'main' into fa-mteb-v1

0d9dd85

ErfunZeinivand and others added 2 commits January 26, 2025 17:08

add data domain and subtask description

587d959

Merge pull request #1 from mehran-sarmadi/fa-mteb-v2

6a74745

add data domain and subtask description

Samoed reviewed Jan 26, 2025

View reviewed changes

Update AbsTaskSummaryRetrieval and FaMTEBSummaryRetrieval

5da96be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

mehran-sarmadi commented Jan 20, 2025 •

edited

Loading

Samoed left a comment

Samoed Jan 25, 2025 •

edited

Loading

mehran-sarmadi Jan 26, 2025

Samoed Jan 26, 2025

mehran-sarmadi Jan 26, 2025

Samoed Jan 25, 2025

Samoed Jan 25, 2025

mehran-sarmadi Jan 26, 2025

Samoed commented Jan 25, 2025

mehran-sarmadi commented Jan 26, 2025

Samoed Jan 26, 2025

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

Are you sure you want to change the base?

Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843

Conversation

mehran-sarmadi commented Jan 20, 2025 • edited Loading

Checklist

Adding datasets checklist

Samoed left a comment

Choose a reason for hiding this comment

Samoed Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

mehran-sarmadi Jan 26, 2025

Choose a reason for hiding this comment

Samoed Jan 26, 2025

Choose a reason for hiding this comment

mehran-sarmadi Jan 26, 2025

Choose a reason for hiding this comment

Samoed Jan 25, 2025

Choose a reason for hiding this comment

Samoed Jan 25, 2025

Choose a reason for hiding this comment

mehran-sarmadi Jan 26, 2025

Choose a reason for hiding this comment

Samoed commented Jan 25, 2025

mehran-sarmadi commented Jan 26, 2025

Samoed Jan 26, 2025

Choose a reason for hiding this comment

mehran-sarmadi commented Jan 20, 2025 •

edited

Loading

Samoed Jan 25, 2025 •

edited

Loading