Skip to content

Commit

Permalink
docs: Add documentation for value matching methods
Browse files Browse the repository at this point in the history
  • Loading branch information
roquelopez committed Nov 19, 2024
1 parent c66de60 commit 41e086e
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,4 @@ You can find the source code in our `GitHub repository <https://github.com/VIDA-

api
schema-matching
value-matching
39 changes: 39 additions & 0 deletions docs/source/value-matching.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Value Matching Methods
======================

This page provides an overview of all value matching methods available in the `bdikit` library.
Some methods reuse the implementation of other libraries such as `PolyFuzz <https://maartengr.github.io/PolyFuzz/>`_ (e.g, `embedding` and `tfidf`) while others are implemented originally for bdikit (e.g., `gpt`).
To see how to use these methods, please refer to the documentation of :py:func:`~bdikit.api.match_values()` in the :py:mod:`~bdikit.api` module.

.. ``bdikit module <api>`.
.. list-table:: bdikit methods
:header-rows: 1

* - Method
- Class
- Description
* - ``gpt``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.GPTValueMatcher`
- | Leverages a large language model (GPT-4) to identify and select the most accurate value matches.

.. list-table:: Methods from other libraries
:header-rows: 1

* - Method
- Class
- Description
* - ``tfidf``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.TFIDFValueMatcher`
- | Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features.
* - ``edit_distance``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EditDistanceValueMatcher`
- | Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics.
* - ``embedding``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EmbeddingValueMatcher`
- | A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the `bert-base-multilingual-cased` model to generate contextualized embeddings, enabling effective multilingual matching.​.
* - ``fasttext``
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.FastTextValueMatcher`
- | This method uses the cosine similarity of FastText embeddings to accurately compare and align values, capturing both semantic and subword-level similarities..

0 comments on commit 41e086e

Please sign in to comment.