Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Allow for functions as token_counters in BaseChunkers #70

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes several changes to the src/chonkie/chunker/base.py file to improve the handling of different tokenizer backends and provide more informative error messages. The most important changes include adding the inspect module to check if an object is a function, updating the initialization logic to handle different types of tokenizers, and enhancing the error messages to include the unsupported tokenizer backend type.

Improvements to tokenizer handling:

  • Added import of inspect module to check if an object is a function. (src/chonkie/chunker/base.py)
  • Updated the __init__ method to first check if the tokenizer_or_token_counter is a string, then check if it is a function using inspect.isfunction, and finally assume it is a tokenizer object if neither condition is met. (src/chonkie/chunker/base.py)

Enhanced error messages:

  • Updated the _get_tokenizer_backend method to include the unsupported tokenizer backend type in the error message. (src/chonkie/chunker/base.py)
  • Updated the _encode, _encode_batch, _decode, and _decode_batch methods to include the unsupported tokenizer backend type in the error messages. (src/chonkie/chunker/base.py) [1] [2] [3] [4]

@bhavnicksm bhavnicksm merged commit 75f214b into main Nov 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant