Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TEVL to speed up sentence chunker #71

Merged
merged 2 commits into from
Nov 26, 2024
Merged

Add TEVL to speed up sentence chunker #71

merged 2 commits into from
Nov 26, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes several changes to the src/chonkie/chunker/sentence.py file to enhance the sentence chunking functionality. The most important changes include adding new imports, introducing a new configuration parameter, adding methods for estimating token counts, and refactoring the chunking logic to improve performance and accuracy.

Enhancements to sentence chunking:

  • src/chonkie/chunker/sentence.py: Added imports for bisect_left and accumulate to support new chunking logic.
  • src/chonkie/chunker/sentence.py: Introduced the use_approximate parameter in the __init__ method to allow for approximate token count estimation. [1] [2]
  • src/chonkie/chunker/sentence.py: Added the _estimate_token_counts method to estimate token counts based on character length, and the _get_feedback method to adjust estimates based on actual token counts.
  • src/chonkie/chunker/sentence.py: Refactored the _prepare_sentences method to use either accurate or estimated token counts, and to calculate sentence positions more efficiently.
  • src/chonkie/chunker/sentence.py: Refactored the chunk method to use cumulative token counts and bisect_left for efficient chunk creation, and added feedback adjustment for better accuracy.

@bhavnicksm bhavnicksm merged commit 3f37fe9 into development Nov 26, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant