Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

Closed
RemixaWorld opened this issue Dec 1, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@RemixaWorld
Copy link

First of all, thank you for your contributions, designing chunking out of the complex process is absolutely significant!
Master Roshi, may I ask about the implementation of "SemanticChunker"? Haha! From what I know, the implementation of semantic chunking might be: dividing the original text into semantic units (sentences/paragraphs/combinations of m sentences, etc.), calculating the similarity between each pair of units, and low-similarity parts will be used as breakpoints.
I'm not sure if my understanding is correct?

Additionally, my working language is Chinese, and when it comes to word segmentation and other operations, I must be cautious with some open-source projects because language inconsistency may complicate issues. I look forward to you making some basic elements transparent or customizable, such as the determination conditions for sentence boundaries in "SentenceChunker" (English punctuation or including Chinese punctuation as well?).

Thank you again!

@RemixaWorld RemixaWorld added the enhancement New feature or request label Dec 1, 2024
@bhavnicksm
Copy link
Collaborator

Hey @RemixaWorld!

Thanks for raising an issue! 😁

I would love to work further with you on incorporating support for Chinese as well. Chonkie always tries to be as open as possible but by default uses English. Except for pre-splitting in some chunkers, they are easily extended to languages other than English.

If you could tell me what punctuations you'd need for Chinese, I can add support for Chinese by default. Again, I want Chonkie to be the go to for you, so if multilingual support would help, I'd be happy to support it 😄

Thanks! 🙏

@RemixaWorld
Copy link
Author

@bhavnicksm Thanks for your reply!
In terms of Chonkie's ease of use, absolutely!
The conventional demarcation punctuation in Chinese is


  • Or in conversation:
  • 。”
  • !”
  • ?”

@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Dec 5, 2024

Hey @RemixaWorld, awesome!

Thanks for the punctuations~ Will try to expose the separation punctuations on the chunkers so it's easy to extend to newer languages at the moment.

Once a proper scalable plan to support languages is finalised, it will eventually look something like a parameter on the chunkers like lang='zh' in the future. But that might still take a while. I hope you continue supporting Chonkie till it reaches that point.

Thanks! 😊

@RemixaWorld
Copy link
Author

@bhavnicksm Sure, come on!

@bhavnicksm
Copy link
Collaborator

Hey @RemixaWorld!

Added preliminary support by exposing the delimiters in #81, you can now do something of this sort:

from chonkie import SemanticChunker

sc = SemanticChunker(delim="。!?")

chunks = sc("some text here") 

Let me know if this works for you and any other suggestions you'd have! Closing this issue for now...

Thanks! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants