Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed similarity_percentile with sdpm chunker + added test #65

Conversation

pratyushmittal
Copy link
Contributor

similarity_percentile was not working with spdm chunker because the similarity_threshold was not computed. I have added test to reproduce the same.

Fix
I moved the computation similarity_threshold into _prepare_sentences. This decouples the computation of similarity_threshold with other steps.

`similarity_percentile` was not working with spdm chunker because the
`similarity_threshold` was not computed. I have added test to reproduce
the same.

To fix this, I moved the computation `similarity_threshold` into
`_prepare_sentences`. This decouples the computation of
`similarity_threshold` with other steps.
@bhavnicksm
Copy link
Collaborator

Hey @pratyushmittal!

Thanks for submitting a PR! 😊

Sorry, just been involved with some work, so it might take a while, I'll review this PR as soon as possible again.

Thanks!

@pratyushmittal
Copy link
Contributor Author

Hey @bhavnicksm, please let me know if you need any help in this PR.

I am loving Chonkie. I want to try out similarity_percentile in sdpm_chunker. That is currently broken. This PR fixes it.

@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Dec 4, 2024

Hey @pratyushmittal!

Sorry for the delayed response, I've been involved in some other aspects so couldn't look at this earlier. 😅

I see the issue you mentioned here, and I don't think it has anything to do with decoupling than with not storing the similarity_threshold once computed.

I added a patch fix for it by converting similarity_threshold to self.similarity_threshold, and it passes the test.

If you could refactor the PR to have a function in the class called _compute_similarity_threshold and then call it inside the chunk fn then, I could go ahead with merging the PR.

If not, I can add a patch for that in later versions.

Thanks~ 😊

@bhavnicksm
Copy link
Collaborator

Hey @pratyushmittal!

Closing this since I added a fix in #79 — this PR wouldn't have been possible without you pointing out the issue and emphasizing it.

Thank you so much for the PR~ Looking fwd to future contributions from you as well 😊

Thanks!

@bhavnicksm bhavnicksm closed this Dec 6, 2024
@pratyushmittal
Copy link
Contributor Author

This is so good @bhavnicksm. Thanks a lot for reviewing the PR, providing the feedback and pushing the fix.

You are doing a wonderful work 🙌🏽!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants