-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843
base: main
Are you sure you want to change the base?
Add FaMTEB (Farsi/Persian Text Embedding Benchmark) #1843
Conversation
37b50d7
to
c57293f
Compare
946ee59
to
5fe3730
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! Can you add mock task of AbsTaskSummaryRetrieval task to https://github.com/embeddings-benchmark/mteb/blob/main/tests/test_benchmark/mock_tasks.py?
unique_summary: int | ||
|
||
|
||
class AbsTaskSummaryRetrieval(AbsTask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could inherit from AbsTaskBitextmining
to reuse the evaluate
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task is different from summarization task it involves identifying the exact summary of a given paragraph from a set of potential summaries. A model embeds sentences and determines the closest pairs using cosine similarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know. It seems that you don't have parallel
datasets in your tasks, maybe you can use evaluate
from AbsTask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task is very similar to Bitextmining
, so I think I can inherit the evaluate
method from it. I can also do the same for the _similarity_search
method in the SummaryRetrievalEvaluator
.
scores[hf_subet] = self._evaluate_subset( | ||
model, | ||
data_split, # type: ignore | ||
subsets=["text", "summary"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't use it anywhere
subsets=["text", "summary"], |
domains=[], | ||
task_subtypes=[], | ||
license="not specified", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fill more information about tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes its ongoing
Maybe we should move this PR to |
I haven’t checked the next version yet, so I’m not sure if any changes are needed. If needed, I’ll make the updates. |
add data domain and subtask description
"CQADupstackAndroidRetrieval-Fa", | ||
"CQADupstackEnglishRetrieval-Fa", | ||
"CQADupstackGamingRetrieval-Fa", | ||
"CQADupstackGisRetrieval-Fa", | ||
"CQADupstackMathematicaRetrieval-Fa", | ||
"CQADupstackPhysicsRetrieval-Fa", | ||
"CQADupstackProgrammersRetrieval-Fa", | ||
"CQADupstackStatsRetrieval-Fa", | ||
"CQADupstackTexRetrieval-Fa", | ||
"CQADupstackUnixRetrieval-Fa", | ||
"CQADupstackWebmastersRetrieval-Fa", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you add these datasets? Also FYI #1771
We are a research team from Sharif University of Technology and MCINext Company developing a text embedding benchmark for the Persian language based on MTEB. So far, we have gathered around 63 datasets spanning 7 tasks (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, and Summary Retrieval), including a mix of existing, translated, and newly generated datasets. Notably, we are introducing the Summary Retrieval task for the first time, which focuses on identifying the correct summary of a paragraph from a set of candidates. We have also evaluated several Persian language models and text embeddings that support Persian for this benchmark.
We also open related PR for the results and leaderboard tab, and we are finalizing a paper on this work, which will be published in the near future.
Checklist
make test
.make lint
.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.