[FIX] start_index
incorrect when chunk_overlap
is not 0 (#116)
#132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several changes to improve the chunking process in the
TokenChunker
class and update the associated tests. The most important changes involve modifying the chunk creation logic to handle overlapping tokens and updating the test suite to use a new tokenizer.Improvements to chunk creation logic:
src/chonkie/chunker/base.py
: Updated the_decode_batch
method to use more efficient batch decoding functions for different tokenizer backends.src/chonkie/chunker/token.py
: Refactored the_create_chunks
method to calculate overlap lengths and adjust the current index accordingly. [1] [2] [3]src/chonkie/chunker/token.py
: Renamed_chunk_generator
to_token_group_generator
for clarity and simplified its implementation.src/chonkie/chunker/token.py
: Updated_process_text_batch
to use the new chunk creation logic and removed unnecessary decoding steps.Updates to test suite:
tests/chunker/test_token_chunker.py
: Replaced thetokenizer
parameter withtiktokenizer
in multiple test functions to use the updated tokenizer. [1] [2] [3] [4] [5] [6]