Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING: v2.0.0 #1433

Draft
wants to merge 50 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
e2520df
fix: Ensure seed is based on RNG State (#1193)
KennethEnevoldsen Nov 11, 2024
9c58518
Consolidate Retrieval/Reranking/Instruction Variants (#1359)
orionw Nov 13, 2024
2a8a370
fix: Unsure TaskResults can handle runtime and version being unspecified
KennethEnevoldsen Nov 14, 2024
dea2b77
Merge branch 'v2.0.0' of https://github.com/embeddings-benchmark/mteb…
KennethEnevoldsen Nov 14, 2024
23d6cb2
fix: remove NaN handling for retrieval
KennethEnevoldsen Nov 14, 2024
8868cd4
Merge branch 'main' into v2.0.0
KennethEnevoldsen Nov 14, 2024
5470c88
fix: Merge main into v2 (#1454)
Samoed Nov 14, 2024
70a3ff2
feat: enable codecarbon by default (#1428)
Samoed Nov 15, 2024
0e9b6fd
Add decriptive stat almost to all datasets (#1466)
Samoed Nov 18, 2024
0a5bedb
fix: Fix test for empty descriptive tasks (#1413)
Samoed Nov 19, 2024
6da2a1a
fix: pin datasets version <3.0.0 (#1471)
Napuh Nov 19, 2024
a27de33
feat: Multilingual retrieval loader (#1473)
Samoed Nov 19, 2024
0df0210
fix: add citations to ModelMeta (#1477)
Samoed Nov 21, 2024
0abe1a0
Add descriptive stats to mising tasks and add number of qrels (#1476)
imenelydiaker Nov 21, 2024
a7a5214
1475 add descriptive stats to all tasks v2 (#1482)
dokato Nov 23, 2024
99247b2
fix: Fix `BrightRetrieval` calculate stats (#1484)
Samoed Nov 23, 2024
022d355
Merge main v2 (#1504)
Samoed Nov 27, 2024
6383950
Fix: retrieval stats (#1496)
Samoed Nov 27, 2024
d54fb75
fix: hatespeech filipino (#1522)
Samoed Nov 28, 2024
dec5d6a
feat: Forbid task metadata and add upload functions (#1362)
Samoed Dec 4, 2024
d0aa3a7
fix: remove `*` imports (#1569)
Samoed Dec 9, 2024
f16deb6
Merge branch 'refs/heads/main' into v2.0.0
Samoed Dec 10, 2024
06fc13f
fix: Add documentation (#1567)
KennethEnevoldsen Dec 16, 2024
6a8e188
fix: reorder argument for mteb.get_tasks (#1597)
KennethEnevoldsen Dec 18, 2024
d6130ad
fix: Make deduplication in PairClassificationEvaluator stable (#1315)
tsirif Dec 19, 2024
c9b00ac
[V2] Update v2 (#1618)
Samoed Dec 22, 2024
3f4a0da
Merge branch 'refs/heads/main' into v2.0.0
Samoed Dec 22, 2024
71c46ea
fix: [V2] Update datasets wich can't be loaded with `datasets>=3.0` (…
Samoed Dec 22, 2024
b3693fb
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 4, 2025
2519c7a
update nanobenchmark stat
Samoed Jan 4, 2025
9bc4a1a
[v2] Remove metadata dict (#1719)
Samoed Jan 8, 2025
38b9dad
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 10, 2025
cc829e5
lint
Samoed Jan 10, 2025
4247e22
[v2] Remove memory usage (#1751)
Samoed Jan 11, 2025
2b41cb4
[v2] fix contriever (add similarity_fn_name to ST wrapper) (#1749)
Samoed Jan 11, 2025
91871fe
[v2] Refactor evaluators and Abstasks (#1707)
Samoed Jan 12, 2025
997a135
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 12, 2025
f73e7ac
openai remove memory usage
Samoed Jan 12, 2025
d946ad4
fix: [v2] _run_eval() for case: co2_tracker False & add test (#1774)
sam-hey Jan 12, 2025
81a272e
Fix RepLLaMA-based models and Instructions for Cross-Encoders (#1733)
orionw Jan 13, 2025
8cf6178
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 13, 2025
296b9ea
lint
Samoed Jan 13, 2025
54018c7
[v2] Remove deprecated parameters from `MTEB` and cli (#1773)
Samoed Jan 15, 2025
3a5aa0c
[v2] remove metadata_dict (#1820)
Samoed Jan 15, 2025
ce5cb3e
[v2] add similarity_fn in ModelMeta (#1759)
sam-hey Jan 17, 2025
5d738bc
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 20, 2025
77f7c83
fix merge
Samoed Jan 20, 2025
6da8a13
[v2] ci: run bm25 and ColBERT test in ci (#1829)
sam-hey Jan 21, 2025
f1d418c
[v2] Update v2 again (#1864)
Samoed Jan 24, 2025
c26adee
Merge branch 'refs/heads/main' into v2.0.0
Samoed Jan 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
35 changes: 35 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# creates the documentation on pushes it to the gh-pages branch
name: Documentation

on:
pull_request:
branches: [main]
push:
branches: [main]


permissions:
contents: write

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[docs]


- name: Build and Deploy
if: github.event_name == 'push'
run: mkdocs gh-deploy --force

- name: Build
if: github.event_name == 'pull_request'
run: make build-docs
25 changes: 18 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,55 @@
## Contributing to MTEB
We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information.
## Contributing to mteb

We welcome contributions to `mteb` such as new tasks, code optimization or benchmarks.

Once you have decided on your contribution, this document describes how to set up the repository for development.


### Development Installation
If you want to submit a dataset or on other ways contribute to MTEB, you can install the package in development mode:

If you want to submit a task or on other ways contribute to `mteb`, you will need to install the package in development mode:

```bash
# download the git repository
git clone https://github.com/embeddings-benchmark/mteb
cd mteb

# create your virtual environment and activate it
make install
```

This uses [make](https://www.gnu.org/software/make/) to define the install command. You can see what each command does in the [makefile](https://github.com/embeddings-benchmark/mteb/blob/main/Makefile).

### Running Tests

To run the tests, you can use the following command:

```bash
make test
```

This is also run by the CI pipeline, so you can be sure that your changes do not break the package. We recommend running the tests in the lowest version of python supported by the package (see the pyproject.toml) to ensure compatibility.
This is also run by the CI pipeline, so if this passed locally, you can be almost sure that your changes will not cause a failed test once you create a pull request. We recommend running the tests in the lowest version of python supported by the package (see the [pyproject.toml](https://github.com/embeddings-benchmark/mteb/blob/main/pyproject.toml)) to ensure compatibility.


### Running linting
To run the linting before a PR you can use the following command:

To run the linting before submitting a pull request, use:

```bash
make lint
```

This command is equivalent to the command run during CI. It will check for code style and formatting issues.


## Semantic Versioning and Releases
MTEB follows [semantic versioning](https://semver.org/). This means that the version number of the package is composed of three numbers: `MAJOR.MINOR.PATCH`. This allow us to use existing tools to automatically manage the versioning of the package. For maintainers (and contributors), this means that commits with the following prefixes will automatically trigger a version bump:

`mteb` follows [semantic versioning](https://semver.org/). This means that the version number of the package is composed of three numbers: `MAJOR.MINOR.PATCH`. This allow us to use existing tools to automatically manage the versioning of the package. For maintainers (and contributors), this means that commits with the following prefixes will automatically trigger a version bump:

- `fix:` for patches
- `feat:` for minor versions
- `breaking:` for major versions

Any commit with one of these prefixes will trigger a version bump upon merging to the main branch as long as tests pass. A version bump will then trigger a new release on PyPI as well as a new release on GitHub.

Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to `mteb`.
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
install:
@echo "--- 🚀 Installing project dependencies ---"
pip install -e ".[dev]"
pip install -e ".[dev,docs]"

install-for-tests:
@echo "--- 🚀 Installing project dependencies for test ---"
@echo "This ensures that the project is not installed in editable mode"
pip install ".[dev,speedtask]"
pip install ".[dev,speedtask,bm25s,pylate]"

lint:
@echo "--- 🧹 Running linters ---"
Expand Down Expand Up @@ -37,6 +37,10 @@ build-docs:
# since we do not have a documentation site, this just build tables for the .md files
python docs/create_tasks_table.py

serve-docs:
@echo "--- 📚 Serving documentation ---"
python -m mkdocs serve


model-load-test:
@echo "--- 🚀 Running model load test ---"
Expand Down
29 changes: 19 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
</h4>

<h3 align="center">
<a href="https://huggingface.co/spaces/mteb/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/hf_logo.png" /></a>
<a href="https://huggingface.co/spaces/mteb/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/logos/hf_logo.png" /></a>
</h3>


Expand Down Expand Up @@ -79,6 +79,7 @@ In prompts the key can be:
8. `STS`
9. `Summarization`
10. `InstructionRetrieval`
11. `InstructionReranking`
3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
4. Task name - these prompts will be used in the specific task
5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
Expand Down Expand Up @@ -496,17 +497,25 @@ evaluation.run(model, ...)

## Citing

MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)", feel free to cite:
MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://aclanthology.org/2023.eacl-main.148/)", feel free to cite:

```bibtex
@article{muennighoff2022mteb,
doi = {10.48550/ARXIV.2210.07316},
url = {https://arxiv.org/abs/2210.07316},
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
@inproceedings{muennighoff-etal-2023-mteb,
title = "{MTEB}: Massive Text Embedding Benchmark",
author = "Muennighoff, Niklas and
Tazi, Nouamane and
Magne, Loic and
Reimers, Nils",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.148",
doi = "10.18653/v1/2023.eacl-main.148",
pages = "2014--2037",
}
```

Expand Down
40 changes: 40 additions & 0 deletions docs/adding_a_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## Adding a new Benchmark

MTEB covers a wide variety of benchmarks that are all presented in the public [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). However, many languages or domains are still missing, and we welcome contributions.

To add a new benchmark, you will need to:

1) [Implement the tasks](adding_a_dataset.md) that you want to include in the benchmark, or find them in the existing list of tasks.
2) Implement the benchmark in the [`benchmark.py`](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/benchmarks.py) file and submit your changes as a single PR.

This is easy to do
```python
tasks = mteb.get_tasks(tasks=[] ...) # fetch the tasks you want to include in your benchmark

MY_BENCHMARK = Benchmark(
name="Name of your benchmark",
tasks=tasks,
description="This benchmark tests y, which is important because of X",
reference="https://relevant_link_eg_to_paper.com",
citation="A bibtex citation if relevant",
)
```

3) Run a representative set of models on benchmark. To submit the results:
<!-- TODO: we should probably create seperate page for how to submit results -->
1. Open a PR on the result [repository](https://github.com/embeddings-benchmark/results) with:
- All results added in existing model folders or new folders
- Updated paths.json (see snippet results.py)
<!-- TODO: ^check if this is still required. If so, we should probably update it. If not, we should remove it once the new leaderboard is live -->
- If any new models are added, add their names to `results.py`
- If you have access to all models you are adding, you can also [add results via the metadata](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md) for all of them / some of them
1. Open a PR at https://huggingface.co/spaces/mteb/leaderboard modifying app.py to add your tab:
- Add any new models & their specs to the global lists
- Add your tab, credits etc to where the other tabs are defined
- If you're adding new results to existing models, remove those models from `EXTERNAL_MODEL_RESULTS.json` such that they can be reloaded with the new results and are not cached.
- You may also have to uncomment `, download_mode='force_redownload', verification_mode="no_checks")` where the datasets are loaded to experiment locally without caching of results
- Test that it runs & works locally as you desire with python app.py, **please add screenshots to the PR**

1) Wait for the automatic update

Once the review from (3) is done the benchmark should appear on the leaderboard once it automatically updated (might take a day).
26 changes: 26 additions & 0 deletions docs/api/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Benchmark

A benchmark within `mteb` is essentially just a list of tasks along with some metadata about the benchmark.


<figure markdown="span">
![](../images/visualizations/benchmark_explainer.png){ width="80%" }
<figcaption>An overview of the benchmark within `mteb`</figcaption>
</figure>

This metadata includes a short description of the benchmark's intention, the reference, and the citation. If you use a benchmark from `mteb`, we recommend that you cite it along with `mteb`.


## Utilities

:::mteb.get_benchmarks

:::mteb.get_benchmark


## The Benchmark Object

<!-- :::mteb.Benchmark -->



26 changes: 26 additions & 0 deletions docs/api/model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Models

<!-- TODO: Encoder or model? Encoder is consistent with the code, but might be less used WDYT? We also use ModelMeta ... -->

A model in `mteb` covers two concepts: metadata and implementation.
- Metadata contains information about the model such as maximum input
length, valid frameworks, license, and degree of openness.
- Implementation is a reproducible workflow, which allows others to run the same model again, using the same prompts, hyperparameters, aggregation strategies, etc.

<figure markdown="span">
![](../images/visualizations/modelmeta_explainer.png){ width="80%" }
<figcaption>An overview of the model and its metadata within `mteb`</figcaption>
</figure>



## Metadata

:::mteb.models.ModelMeta

## The Encoder Interface

:::mteb.Encoder



36 changes: 36 additions & 0 deletions docs/api/task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Tasks

A task is an implementation of a dataset for evaluation. It could, for instance, be the MIRACL dataset consisting of queries, a corpus of documents
,and the correct documents to retrieve for a given query. In addition to the dataset, a task includes the specifications for how a model should be run on the dataset and how its output should be evaluated. Each task also comes with extensive metadata including the license, who annotated the data, etc.

<figure markdown="span">
![](../images/visualizations/task_explainer.png){ width="80%" }
<figcaption>An overview of the tasks within `mteb`</figcaption>
</figure>

## Utilities

:::mteb.get_tasks

:::mteb.get_task

## Metadata

Each task also contains extensive metadata. We annotate this using the following object, which allows us to use [pydantic](https://docs.pydantic.dev/latest/) to validate the metadata.

:::mteb.TaskMetadata
options:
members: true



## The Task Object

All tasks in `mteb` inherits from the following abstract class.


:::mteb.AbsTask
<!--
TODO: we probably need to hide some of the method and potentially add a docstring to the class.
-->

13 changes: 13 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# CLI

<!--
We essentially just need to make this cli.py's docstring -- figure out a way to do this automatically

We can then extend it to be more detailed going forward. Ideally adding some documentation on the different arguments
-->



## Using multiple GPUs

Using multiple GPUs in parallel can be done by just having a [custom encode function](missing) that distributes the inputs to multiple GPUs like e.g. [here](https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/e5/mteb_eval.py#L60) or [here](https://github.com/ContextualAI/gritlm/blob/09d8630f0c95ac6a456354bcb6f964d7b9b6a609/gritlm/gritlm.py#L75).
21 changes: 21 additions & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Getting Started

## Installation

You can install `mteb` using [pip](https://pip.pypa.io/en/stable/getting-started/) simply by running:

```bash
pip install mteb
```

??? tip "Model Specific Installations"

If you want to run certain models implemented within mteb you will often need some additional dependencies. These can be installed using:

```bash
pip install mteb[openai]
```

If a specific mdel requires a dependency it will raise an error with the recommended installation. To get an overview of the implemented models see [here](missing).

<!-- TODO: Add usage examples -->
File renamed without changes
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/visualizations/task_explainer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@

# Overview


This is the API documentation for `mteb` a package for benchmark and evaluating the quality of embeddings.
This package was initially introduced as a package for evaluating text embeddings for English[@mteb_2023], but have since been extended cover multiple languages and multiple modalities.
<!-- TODO add [@mmteb_2025] [@mieb_2025]. -->

# Package Overview
This package generally consists of three main concepts *benchmarks*, *tasks* and *model implementations*.

## Benchmarks

A benchmark is a tool to evaluate an embedding model for a given use case. For instance, `mteb(eng)` is intended
to evaluate the quality of text embedding models for broad range of English use-cases such retrieval, classification, and reranking.
A benchmark consist of a collection of tasks. When a model is run on a benchmark it is run on each task individually.


<figure markdown="span">
![](images/visualizations/benchmark_explainer.png){ width="80%" }
<figcaption>An overview of the benchmark within `mteb`</figcaption>
</figure>

## Task

A task is an implementation of a dataset for evaluation. It could for instance be the MIRACL dataset consisting of queries, a corpus of documents
as well as the correct documents to retrieve for a given query. In addition to the dataset a task includes specification for how a model should be run on the dataset and how its output should be evaluation. We implement a variety of different tasks e.g. for evaluating classification, retrieval etc., We denote these [task categories](task.md#metadata). Each task also come with extensive [metadata](api/task.md#metadata) including the license, who annotated the data and so on.

<figure markdown="span">
![](../images/visualizations/task_explainer.png){ width="80%" }
<figcaption>An overview of the tasks within `mteb`</figcaption>
</figure>

## Model Implementation

A model implementation is simply an implementation of an embedding model or API to ensure that others can reproduce the *exact* results on a given task.
For instance, when running the OpenAI embedding API on a document larger than the maximum amount of tokens a user will have to decide how they want to
deal with this limitations (e.g. by truncating the sequence). Having a shared implementation allow us to examine these implementtion assumptions and allow
for reproducible workflow. To ensure consistency we define a [standard interface](api/model.md#the-encoder-interface) that models should follow to be implemented. These implementations additionally come with [metadata](api/model.md#metadata), that for exampe include license, compatible frameworks, and whether the weight are public or not.

<figure markdown="span">
![](images/visualizations/modelmeta_explainer.png){ width="80%" }
<figcaption>An overview of the model and its metadata within `mteb`</figcaption>
</figure>

Loading
Loading