Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Semantic Deduplication #9

Merged
merged 1 commit into from
Sep 29, 2024
Merged

Add Semantic Deduplication #9

merged 1 commit into from
Sep 29, 2024

Conversation

sidjha1
Copy link
Collaborator

@sidjha1 sidjha1 commented Sep 27, 2024

Add a semantic deduplication operator. The dedup is performed based on semantic similarity via embedding model. Pairs of elements whose similarity exceed threshold are considered duplicates.

Example

data = {
    "Text": [
        "Probability and Random Processes",
        "Optimization Methods in Engineering",
        "Digital Design and Integrated Circuits",
        "Computer Security",
        "I don't know what day it is",
        "I don't know what time it is",
        "Harry potter and the Sorcerer's Stone",
    ]
}
df = pd.DataFrame(data)
df = df.sem_index("Text", "index_dir").sem_dedup("Text", threshold=0.815)
print(df)

Will print

                                    Text
1    Optimization Methods in Engineering
4            I don't know what day it is
6  Harry potter and the Sorcerer's Stone

@sidjha1 sidjha1 requested a review from liana313 September 27, 2024 23:48
@liana313 liana313 merged commit ff20f9b into main Sep 29, 2024
1 check passed
@sidjha1 sidjha1 deleted the sid/add-dedup branch October 1, 2024 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants