Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes #79

Merged
merged 4 commits into from
Dec 6, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes significant changes to the SDPMChunker and SemanticChunker classes to improve the chunking logic and introduce new parameters for better control over the chunking process. Key updates include the addition of new methods, parameters, and refactoring of existing logic to enhance functionality and readability.

Major Changes:

Enhancements to SDPMChunker:

  • Added new parameters to the SDPMChunker class, including mode, threshold, similarity_window, min_sentences, min_characters_per_sentence, and threshold_step, to provide more control over the chunking process. (src/chonkie/chunker/sdpm.py)
  • Introduced a new method for calculating the similarity threshold via binary search, improving the accuracy of chunking based on semantic similarity. (src/chonkie/chunker/sdpm.py)

Enhancements to SemanticChunker:

  • Added new parameters to the SemanticChunker class, similar to SDPMChunker, to provide better control over the chunking process. (src/chonkie/chunker/semantic.py) [1] [2]
  • Improved the _prepare_sentences method to compute embeddings based on a similarity window, enhancing the accuracy of semantic grouping. (src/chonkie/chunker/semantic.py)
  • Introduced new methods for calculating similarity thresholds using binary search and percentile, providing more flexibility in determining chunk boundaries. (src/chonkie/chunker/semantic.py)

Code Refactoring:

  • Refactored the _encode and _encode_batch methods in BaseChunker to include return type annotations for better type checking and readability. (src/chonkie/chunker/base.py) [1] [2]
  • Updated docstrings and method descriptions to reflect the new parameters and methods, improving documentation and code clarity. (src/chonkie/chunker/sdpm.py, src/chonkie/chunker/semantic.py) [1] [2]

These changes collectively enhance the functionality, flexibility, and readability of the chunking logic in the chonkie library.

bhavnicksm and others added 4 commits December 6, 2024 02:41
- Introduced a warning message when no similarity threshold is specified, defaulting to the 80th percentile.
- Removed the previous blocking warning and replaced it with a non-blocking warning to improve user experience.
- Adjusted the calculation of similarity threshold to ensure proper handling of percentile values.
- Updated the method signatures of `_encode` and `_encode_batch` in the BaseChunker class to include return type hints, improving code clarity and type safety.
- The `_encode` method now explicitly returns a List[int], while `_encode_batch` returns a List[List[int]].
…ing percentile mode

- Updated SDPMChunker and SemanticChunker to replace similarity_threshold and similarity_percentile with a unified threshold parameter, enhancing clarity and usability.
- Introduced new parameters: mode, min_sentences, min_characters_per_sentence, and threshold_step to provide more control over chunking behavior.
- Refactored chunking logic to support both cumulative and window-based grouping of sentences, improving flexibility in semantic chunking.
- Enhanced docstrings and method signatures for better documentation and understanding of class functionalities.
- Updated tests to reflect changes in parameter names and ensure proper initialization and functionality of chunkers.
@bhavnicksm bhavnicksm self-assigned this Dec 6, 2024
@bhavnicksm bhavnicksm added the enhancement New feature or request label Dec 6, 2024
@bhavnicksm bhavnicksm merged commit 1e784c2 into main Dec 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant