[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116) #132

bhavnicksm · 2025-01-04T22:07:32Z

This pull request includes several changes to improve the chunking process in the TokenChunker class and update the associated tests. The most important changes involve modifying the chunk creation logic to handle overlapping tokens and updating the test suite to use a new tokenizer.

Improvements to chunk creation logic:

src/chonkie/chunker/base.py: Updated the _decode_batch method to use more efficient batch decoding functions for different tokenizer backends.
src/chonkie/chunker/token.py: Refactored the _create_chunks method to calculate overlap lengths and adjust the current index accordingly. [1] [2] [3]
src/chonkie/chunker/token.py: Renamed _chunk_generator to _token_group_generator for clarity and simplified its implementation.
src/chonkie/chunker/token.py: Updated _process_text_batch to use the new chunk creation logic and removed unnecessary decoding steps.

Updates to test suite:

tests/chunker/test_token_chunker.py: Replaced the tokenizer parameter with tiktokenizer in multiple test functions to use the updated tokenizer. [1] [2] [3] [4] [5] [6]

- removed the unnecessary `join` as there is only one token_group. - replaced `_decode_batch` with `_decode`

- `start_index` remains 0 when `chunk_overlap` is 0, fixed it.

- applies only when chunk_overlap > 0 - batch decoding for overlap texts

[FIX] #116: Incorrect`start_index` when `chunk_overlap` is not 0

Udayk02 and others added 11 commits January 2, 2025 18:08

bugfix #116

6910b92

update: bugfix #116

83940b9

- removed the unnecessary `join` as there is only one token_group. - replaced `_decode_batch` with `_decode`

update: bugfix #116

53d532d

- `start_index` remains 0 when `chunk_overlap` is 0, fixed it.

update: bugfix #116

e069fb7

- applies only when chunk_overlap > 0 - batch decoding for overlap texts

Merge branch 'chonkie-ai:development' into development

5d401a0

Merge pull request #126 from Udayk02/development

5b75303

[FIX] #116: Incorrect`start_index` when `chunk_overlap` is not 0

[fix] use proper decode batch functions in _decode_batch

87b6306

[fix] start_index shouldn't use full_text find in batch mode

0d3069e

use tiktoken for most tests

a9d5eaa

[minor] fix the generator type syntax

09e26a3

Merge branch 'main' into development

09f7192

bhavnicksm merged commit 4c09a64 into main Jan 4, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116) #132

[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116) #132

bhavnicksm commented Jan 4, 2025

[FIX] start_index incorrect when chunk_overlap is not 0 (#116) #132

[FIX] start_index incorrect when chunk_overlap is not 0 (#116) #132

Conversation

bhavnicksm commented Jan 4, 2025

[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116) #132

[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116) #132