Add Semantic Deduplication #9

sidjha1 · 2024-09-27T23:48:22Z

Add a semantic deduplication operator. The dedup is performed based on semantic similarity via embedding model. Pairs of elements whose similarity exceed threshold are considered duplicates.

Example

data = {
    "Text": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "I don't know what day it is",
        "I don't know what time it is",
        "Harry potter and the Sorcerer's Stone",
    ]
}
df = pd.DataFrame(data)
df = df.sem_index("Text", "index_dir").sem_dedup("Text", threshold=0.815)
print(df)

Will print

                                    Text
1    Optimization Methods in Engineering
4            I don't know what day it is
6  Harry potter and the Sorcerer's Stone

Add semantic dedup

ef13deb

sidjha1 requested a review from liana313 September 27, 2024 23:48

liana313 approved these changes Sep 29, 2024

View reviewed changes

liana313 merged commit ff20f9b into main Sep 29, 2024
1 check passed

sidjha1 deleted the sid/add-dedup branch October 1, 2024 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Semantic Deduplication #9

Add Semantic Deduplication #9

sidjha1 commented Sep 27, 2024 •

edited

Loading

Add Semantic Deduplication #9

Add Semantic Deduplication #9

Conversation

sidjha1 commented Sep 27, 2024 • edited Loading

sidjha1 commented Sep 27, 2024 •

edited

Loading