-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add documentation for value matching methods
- Loading branch information
1 parent
c66de60
commit 41e086e
Showing
2 changed files
with
40 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Value Matching Methods | ||
====================== | ||
|
||
This page provides an overview of all value matching methods available in the `bdikit` library. | ||
Some methods reuse the implementation of other libraries such as `PolyFuzz <https://maartengr.github.io/PolyFuzz/>`_ (e.g, `embedding` and `tfidf`) while others are implemented originally for bdikit (e.g., `gpt`). | ||
To see how to use these methods, please refer to the documentation of :py:func:`~bdikit.api.match_values()` in the :py:mod:`~bdikit.api` module. | ||
|
||
.. ``bdikit module <api>`. | ||
.. list-table:: bdikit methods | ||
:header-rows: 1 | ||
|
||
* - Method | ||
- Class | ||
- Description | ||
* - ``gpt`` | ||
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.GPTValueMatcher` | ||
- | Leverages a large language model (GPT-4) to identify and select the most accurate value matches. | ||
|
||
.. list-table:: Methods from other libraries | ||
:header-rows: 1 | ||
|
||
* - Method | ||
- Class | ||
- Description | ||
* - ``tfidf`` | ||
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.TFIDFValueMatcher` | ||
- | Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features. | ||
* - ``edit_distance`` | ||
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EditDistanceValueMatcher` | ||
- | Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics. | ||
* - ``embedding`` | ||
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.EmbeddingValueMatcher` | ||
- | A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the `bert-base-multilingual-cased` model to generate contextualized embeddings, enabling effective multilingual matching.. | ||
* - ``fasttext`` | ||
- :class:`~bdikit.mapping_algorithms.value_mapping.algorithms.FastTextValueMatcher` | ||
- | This method uses the cosine similarity of FastText embeddings to accurately compare and align values, capturing both semantic and subword-level similarities.. |