embeddings-benchmark · KennethEnevoldsen · Nov 11, 2024 · Nov 13, 2024 · Nov 14, 2024 · Nov 14, 2024
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -0,0 +1,35 @@
+# creates the documentation on pushes it to the gh-pages branch
+name: Documentation
+
+on:
+  pull_request:
+    branches: [main]
+  push:
+    branches: [main]
+
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+
+      - name: Dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e .[docs]
+
+
+      - name: Build and Deploy
+        if: github.event_name == 'push'
+        run: mkdocs gh-deploy --force
+
+      - name: Build
+        if: github.event_name == 'pull_request'
+        run: make build-docs
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,44 +1,55 @@
-## Contributing to MTEB
-We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information. 
+## Contributing to mteb
+
+We welcome contributions to `mteb` such as new tasks, code optimization or benchmarks.
 
 Once you have decided on your contribution, this document describes how to set up the repository for development.
 
+
 ### Development Installation
-If you want to submit a dataset or on other ways contribute to MTEB, you can install the package in development mode:
+
+If you want to submit a task or on other ways contribute to `mteb`, you will need to install the package in development mode:
 
 ```bash
+# download the git repository
 git clone https://github.com/embeddings-benchmark/mteb
 cd mteb
 
 # create your virtual environment and activate it
 make install
 ```
 
+This uses [make](https://www.gnu.org/software/make/) to define the install command. You can see what each command does in the [makefile](https://github.com/embeddings-benchmark/mteb/blob/main/Makefile).  
+
 ### Running Tests
+
 To run the tests, you can use the following command:
 
 ```bash
 make test
 ```
 
-This is also run by the CI pipeline, so you can be sure that your changes do not break the package. We recommend running the tests in the lowest version of python supported by the package (see the pyproject.toml) to ensure compatibility.
+This is also run by the CI pipeline, so if this passed locally, you can be almost sure that your changes will not cause a failed test once you create a pull request. We recommend running the tests in the lowest version of python supported by the package (see the [pyproject.toml](https://github.com/embeddings-benchmark/mteb/blob/main/pyproject.toml)) to ensure compatibility.
+
 
 ### Running linting
-To run the linting before a PR you can use the following command:
+
+To run the linting before submitting a pull request, use:
 
 ```bash
 make lint
 ```
 
 This command is equivalent to the command run during CI. It will check for code style and formatting issues.
 
+
 ## Semantic Versioning and Releases
-MTEB follows [semantic versioning](https://semver.org/). This means that the version number of the package is composed of three numbers: `MAJOR.MINOR.PATCH`. This allow us to use existing tools to automatically manage the versioning of the package. For maintainers (and contributors), this means that commits with the following prefixes will automatically trigger a version bump:
+
+`mteb` follows [semantic versioning](https://semver.org/). This means that the version number of the package is composed of three numbers: `MAJOR.MINOR.PATCH`. This allow us to use existing tools to automatically manage the versioning of the package. For maintainers (and contributors), this means that commits with the following prefixes will automatically trigger a version bump:
 
 - `fix:` for patches
 - `feat:` for minor versions
 - `breaking:` for major versions
 
 Any commit with one of these prefixes will trigger a version bump upon merging to the main branch as long as tests pass. A version bump will then trigger a new release on PyPI as well as a new release on GitHub.
 
-Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
+Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to `mteb`.
diff --git a/Makefile b/Makefile
@@ -1,11 +1,11 @@
 install:
 	@echo "--- 🚀 Installing project dependencies ---"
-	pip install -e ".[dev]"
+	pip install -e ".[dev,docs]"
 
 install-for-tests:
 	@echo "--- 🚀 Installing project dependencies for test ---"
 	@echo "This ensures that the project is not installed in editable mode"
-	pip install ".[dev,speedtask]"
+	pip install ".[dev,speedtask,bm25s,pylate]"
 
 lint:
 	@echo "--- 🧹 Running linters ---"
@@ -37,6 +37,10 @@ build-docs:
 	# since we do not have a documentation site, this just build tables for the .md files
 	python docs/create_tasks_table.py
 
+serve-docs:
+	@echo "--- 📚 Serving documentation ---"
+	python -m mkdocs serve
+
 
 model-load-test:
 	@echo "--- 🚀 Running model load test ---"

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@
 </h4>
 
 <h3 align="center">
-    <a href="https://huggingface.co/spaces/mteb/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/hf_logo.png" /></a>
+    <a href="https://huggingface.co/spaces/mteb/leaderboard"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="./docs/images/logos/hf_logo.png" /></a>
 </h3>
 
 
@@ -79,6 +79,7 @@ In prompts the key can be:
    8. `STS`
    9. `Summarization`
    10. `InstructionRetrieval`
+   11. `InstructionReranking`
 3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
 4. Task name - these prompts will be used in the specific task
 5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
@@ -496,17 +497,25 @@ evaluation.run(model, ...)
 
 ## Citing
 
-MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)", feel free to cite:
+MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://aclanthology.org/2023.eacl-main.148/)", feel free to cite:
 
 ```bibtex
-@article{muennighoff2022mteb,
-  doi = {10.48550/ARXIV.2210.07316},
-  url = {https://arxiv.org/abs/2210.07316},
-  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
-  title = {MTEB: Massive Text Embedding Benchmark},
-  publisher = {arXiv},
-  journal={arXiv preprint arXiv:2210.07316},  
-  year = {2022}
+@inproceedings{muennighoff-etal-2023-mteb,
+    title = "{MTEB}: Massive Text Embedding Benchmark",
+    author = "Muennighoff, Niklas  and
+      Tazi, Nouamane  and
+      Magne, Loic  and
+      Reimers, Nils",
+    editor = "Vlachos, Andreas  and
+      Augenstein, Isabelle",
+    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
+    month = may,
+    year = "2023",
+    address = "Dubrovnik, Croatia",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.eacl-main.148",
+    doi = "10.18653/v1/2023.eacl-main.148",
+    pages = "2014--2037",
 }
 ```
 

diff --git a/docs/adding_a_benchmark.md b/docs/adding_a_benchmark.md
@@ -0,0 +1,40 @@
+## Adding a new Benchmark 
+
+MTEB covers a wide variety of benchmarks that are all presented in the public [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). However, many languages or domains are still missing, and we welcome contributions.
+
+To add a new benchmark, you will need to:
+
+1) [Implement the tasks](adding_a_dataset.md) that you want to include in the benchmark, or find them in the existing list of tasks.
+2) Implement the benchmark in the [`benchmark.py`](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/benchmarks.py) file and submit your changes as a single PR.
+
+This is easy to do 
+```python
+tasks = mteb.get_tasks(tasks=[] ...) # fetch the tasks you want to include in your benchmark
+
+MY_BENCHMARK = Benchmark(
+    name="Name of your benchmark",
+    tasks=tasks,
+    description="This benchmark tests y, which is important because of X",
+    reference="https://relevant_link_eg_to_paper.com",
+    citation="A bibtex citation if relevant",
+)
+```
+
+3) Run a representative set of models on benchmark. To submit the results: 
+<!-- TODO: we should probably create seperate page for how to submit results -->
+1. Open a PR on the result [repository](https://github.com/embeddings-benchmark/results) with:
+- All results added in existing model folders or new folders
+- Updated paths.json (see snippet results.py)
+<!-- TODO: ^check if this is still required. If so, we should probably update it. If not, we should remove it once the new leaderboard is live -->
+- If any new models are added, add their names to `results.py`
+- If you have access to all models you are adding, you can also [add results via the metadata](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md) for all of them / some of them
+1. Open a PR at https://huggingface.co/spaces/mteb/leaderboard modifying app.py to add your tab:
+- Add any new models & their specs to the global lists
+- Add your tab, credits etc to where the other tabs are defined
+- If you're adding new results to existing models, remove those models from `EXTERNAL_MODEL_RESULTS.json` such that they can be reloaded with the new results and are not cached.
+- You may also have to uncomment `, download_mode='force_redownload', verification_mode="no_checks")` where the datasets are loaded to experiment locally without caching of results
+- Test that it runs & works locally as you desire with python app.py, **please add screenshots to the PR**
+
+1) Wait for the automatic update
+
+Once the review from (3) is done the benchmark should appear on the leaderboard once it automatically updated (might take a day).
diff --git a/docs/api/benchmark.md b/docs/api/benchmark.md
@@ -0,0 +1,26 @@
+# Benchmark
+
+A benchmark within `mteb` is essentially just a list of tasks along with some metadata about the benchmark.
+
+
+<figure markdown="span">
+    ![](../images/visualizations/benchmark_explainer.png){ width="80%" }
+    <figcaption>An overview of the benchmark within `mteb`</figcaption>
+</figure>
+
+This metadata includes a short description of the benchmark's intention, the reference, and the citation. If you use a benchmark from `mteb`, we recommend that you cite it along with `mteb`.
+
+
+## Utilities
+
+:::mteb.get_benchmarks
+
+:::mteb.get_benchmark
+
+
+## The Benchmark Object
+
+<!-- :::mteb.Benchmark -->
+
+
+
diff --git a/docs/api/model.md b/docs/api/model.md
@@ -0,0 +1,26 @@
+# Models
+
+<!-- TODO: Encoder or model? Encoder is consistent with the code, but might be less used WDYT? We also use ModelMeta ... -->
+
+A model in `mteb` covers two concepts: metadata and implementation. 
+- Metadata contains information about the model such as maximum input
+length, valid frameworks, license, and degree of openness. 
+- Implementation is a reproducible workflow, which allows others to run the same model again, using the same prompts, hyperparameters, aggregation strategies, etc.
+
+<figure markdown="span">
+    ![](../images/visualizations/modelmeta_explainer.png){ width="80%" }
+    <figcaption>An overview of the model and its metadata within `mteb`</figcaption>
+</figure>
+
+
+
+## Metadata
+
+:::mteb.models.ModelMeta
+
+## The Encoder Interface
+
+:::mteb.Encoder
+
+
+
diff --git a/docs/api/task.md b/docs/api/task.md
@@ -0,0 +1,36 @@
+# Tasks
+
+A task is an implementation of a dataset for evaluation. It could, for instance, be the MIRACL dataset consisting of queries, a corpus of documents 
+,and the correct documents to retrieve for a given query. In addition to the dataset, a task includes the specifications for how a model should be run on the dataset and how its output should be evaluated. Each task also comes with extensive metadata including the license, who annotated the data, etc.
+
+<figure markdown="span">
+    ![](../images/visualizations/task_explainer.png){ width="80%" }
+    <figcaption>An overview of the tasks within `mteb`</figcaption>
+</figure>
+
+## Utilities
+
+:::mteb.get_tasks
+
+:::mteb.get_task
+
+## Metadata
+
+Each task also contains extensive metadata. We annotate this using the following object, which allows us to use [pydantic](https://docs.pydantic.dev/latest/) to validate the metadata. 
+
+:::mteb.TaskMetadata
+    options:
+      members: true
+
+
+
+## The Task Object
+
+All tasks in `mteb` inherits from the following abstract class.
+
+
+:::mteb.AbsTask
+<!-- 
+TODO: we probably need to hide some of the method and potentially add a docstring to the class.
+-->
+
diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,13 @@
+# CLI
+
+<!-- 
+We essentially just need to make this cli.py's docstring -- figure out a way to do this automatically
+
+We can then extend it to be more detailed going forward. Ideally adding some documentation on the different arguments
+-->
+
+
+
+## Using multiple GPUs
+
+Using multiple GPUs in parallel can be done by just having a [custom encode function](missing) that distributes the inputs to multiple GPUs like e.g. [here](https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/e5/mteb_eval.py#L60) or [here](https://github.com/ContextualAI/gritlm/blob/09d8630f0c95ac6a456354bcb6f964d7b9b6a609/gritlm/gritlm.py#L75).
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -0,0 +1,21 @@
+# Getting Started
+
+## Installation
+
+You can install `mteb` using [pip](https://pip.pypa.io/en/stable/getting-started/) simply by running:
+
+```bash
+pip install mteb
+```
+
+??? tip "Model Specific Installations"
+
+    If you want to run certain models implemented within mteb you will often need some additional dependencies. These can be installed using:
+
+    ```bash
+    pip install mteb[openai]
+    ```
+
+    If a specific mdel requires a dependency it will raise an error with the recommended installation. To get an overview of the implemented models see [here](missing).
+
+<!-- TODO: Add usage examples -->
diff --git a/docs/images/hf_logo.png → docs/images/logos/hf_logo.png b/docs/images/hf_logo.png → docs/images/logos/hf_logo.png
diff --git a/...es/mteb_logo/mteb_logo_tight_hfhub.drawio → ...os/mteb_logo/mteb_logo_tight_hfhub.drawio b/...es/mteb_logo/mteb_logo_tight_hfhub.drawio → ...os/mteb_logo/mteb_logo_tight_hfhub.drawio
diff --git a/...mages/mteb_logo/mteb_logo_tight_hfhub.png → ...logos/mteb_logo/mteb_logo_tight_hfhub.png b/...mages/mteb_logo/mteb_logo_tight_hfhub.png → ...logos/mteb_logo/mteb_logo_tight_hfhub.png
diff --git a/...mages/mteb_logo/mteb_logo_transparent.png → ...logos/mteb_logo/mteb_logo_transparent.png b/...mages/mteb_logo/mteb_logo_transparent.png → ...logos/mteb_logo/mteb_logo_transparent.png
diff --git a/...es/mteb_logo/mteb_logo_wide_github.drawio → ...os/mteb_logo/mteb_logo_wide_github.drawio b/...es/mteb_logo/mteb_logo_wide_github.drawio → ...os/mteb_logo/mteb_logo_wide_github.drawio
diff --git a/...mages/mteb_logo/mteb_logo_wide_github.png → ...logos/mteb_logo/mteb_logo_wide_github.png b/...mages/mteb_logo/mteb_logo_wide_github.png → ...logos/mteb_logo/mteb_logo_wide_github.png
diff --git a/docs/images/mmteb_overview_wide.png → docs/images/mmteb/mmteb_overview_wide.png b/docs/images/mmteb_overview_wide.png → docs/images/mmteb/mmteb_overview_wide.png
diff --git a/docs/images/mmteb_overview_wide.svg → docs/images/mmteb/mmteb_overview_wide.svg b/docs/images/mmteb_overview_wide.svg → docs/images/mmteb/mmteb_overview_wide.svg
diff --git a/docs/images/mmteb_overview_wide_centered.png → ...es/mmteb/mmteb_overview_wide_centered.png b/docs/images/mmteb_overview_wide_centered.png → ...es/mmteb/mmteb_overview_wide_centered.png
diff --git a/docs/images/mmteb_overview_wide_centered.svg → ...es/mmteb/mmteb_overview_wide_centered.svg b/docs/images/mmteb_overview_wide_centered.svg → ...es/mmteb/mmteb_overview_wide_centered.svg
diff --git a/docs/images/mteb_overview.svg → docs/images/mmteb/mteb_overview.svg b/docs/images/mteb_overview.svg → docs/images/mmteb/mteb_overview.svg
diff --git a/docs/images/visualizations/benchmark_explainer.png b/docs/images/visualizations/benchmark_explainer.png
diff --git a/docs/images/visualizations/modelmeta_explainer.png b/docs/images/visualizations/modelmeta_explainer.png
diff --git a/docs/images/visualizations/task_explainer.png b/docs/images/visualizations/task_explainer.png
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,45 @@
+
+# Overview
+
+
+This is the API documentation for `mteb` a package for benchmark and evaluating the quality of embeddings. 
+This package was initially introduced as a package for evaluating text embeddings for English[@mteb_2023], but have since been extended cover multiple languages and multiple modalities.
+<!-- TODO add [@mmteb_2025] [@mieb_2025]. --> 
+
+# Package Overview
+This package generally consists of three main concepts *benchmarks*, *tasks* and *model implementations*.
+
+## Benchmarks
+
+A benchmark is a tool to evaluate an embedding model for a given use case. For instance, `mteb(eng)` is intended 
+to evaluate the quality of text embedding models for broad range of English use-cases such retrieval, classification, and reranking. 
+A benchmark consist of a collection of tasks. When a model is run on a benchmark it is run on each task individually.
+
+
+<figure markdown="span">
+    ![](images/visualizations/benchmark_explainer.png){ width="80%" }
+    <figcaption>An overview of the benchmark within `mteb`</figcaption>
+</figure>
+
+## Task
+
+A task is an implementation of a dataset for evaluation. It could for instance be the MIRACL dataset consisting of queries, a corpus of documents 
+as well as the correct documents to retrieve for a given query. In addition to the dataset a task includes specification for how a model should be run on the dataset and how its output should be evaluation. We implement a variety of different tasks e.g. for evaluating classification, retrieval etc., We denote these [task categories](task.md#metadata). Each task also come with extensive [metadata](api/task.md#metadata) including the license, who annotated the data and so on.
+
+<figure markdown="span">
+    ![](../images/visualizations/task_explainer.png){ width="80%" }
+    <figcaption>An overview of the tasks within `mteb`</figcaption>
+</figure>
+
+## Model Implementation
+
+A model implementation is simply an implementation of an embedding model or API to ensure that others can reproduce the *exact* results on a given task.
+For instance, when running the OpenAI embedding API on a document larger than the maximum amount of tokens a user will have to decide how they want to
+deal with this limitations (e.g. by truncating the sequence). Having a shared implementation allow us to examine these implementtion assumptions and allow
+for reproducible workflow. To ensure consistency we define a [standard interface](api/model.md#the-encoder-interface) that models should follow to be implemented. These implementations additionally come with [metadata](api/model.md#metadata), that for exampe include license, compatible frameworks, and whether the weight are public or not.
+
+<figure markdown="span">
+    ![](images/visualizations/modelmeta_explainer.png){ width="80%" }
+    <figcaption>An overview of the model and its metadata within `mteb`</figcaption>
+</figure>
+