Add TEVL to speed up sentence chunker #71

bhavnicksm · 2024-11-26T01:43:53Z

This pull request includes several changes to the src/chonkie/chunker/sentence.py file to enhance the sentence chunking functionality. The most important changes include adding new imports, introducing a new configuration parameter, adding methods for estimating token counts, and refactoring the chunking logic to improve performance and accuracy.

Enhancements to sentence chunking:

src/chonkie/chunker/sentence.py: Added imports for bisect_left and accumulate to support new chunking logic.
src/chonkie/chunker/sentence.py: Introduced the use_approximate parameter in the __init__ method to allow for approximate token count estimation. [1] [2]
src/chonkie/chunker/sentence.py: Added the _estimate_token_counts method to estimate token counts based on character length, and the _get_feedback method to adjust estimates based on actual token counts.
src/chonkie/chunker/sentence.py: Refactored the _prepare_sentences method to use either accurate or estimated token counts, and to calculate sentence positions more efficiently.
src/chonkie/chunker/sentence.py: Refactored the chunk method to use cumulative token counts and bisect_left for efficient chunk creation, and added feedback adjustment for better accuracy.

bhavnicksm added 2 commits November 26, 2024 07:12

Add TEVL to speed up sentence chunker

5250ea9

[chore] run ruff linting

39304b0

bhavnicksm merged commit 3f37fe9 into development Nov 26, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TEVL to speed up sentence chunker #71

Add TEVL to speed up sentence chunker #71

bhavnicksm commented Nov 26, 2024

Add TEVL to speed up sentence chunker #71

Add TEVL to speed up sentence chunker #71

Conversation

bhavnicksm commented Nov 26, 2024