-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76
Comments
Hey @RemixaWorld! Thanks for raising an issue! 😁 I would love to work further with you on incorporating support for Chinese as well. Chonkie always tries to be as open as possible but by default uses English. Except for pre-splitting in some chunkers, they are easily extended to languages other than English. If you could tell me what punctuations you'd need for Chinese, I can add support for Chinese by default. Again, I want Chonkie to be the go to for you, so if multilingual support would help, I'd be happy to support it 😄 Thanks! 🙏 |
@bhavnicksm Thanks for your reply!
|
Hey @RemixaWorld, awesome! Thanks for the punctuations~ Will try to expose the separation punctuations on the chunkers so it's easy to extend to newer languages at the moment. Once a proper scalable plan to support languages is finalised, it will eventually look something like a parameter on the chunkers like Thanks! 😊 |
@bhavnicksm Sure, come on! |
Hey @RemixaWorld! Added preliminary support by exposing the delimiters in #81, you can now do something of this sort: from chonkie import SemanticChunker
sc = SemanticChunker(delim="。!?")
chunks = sc("some text here") Let me know if this works for you and any other suggestions you'd have! Closing this issue for now... Thanks! 😊 |
First of all, thank you for your contributions, designing chunking out of the complex process is absolutely significant!
Master Roshi, may I ask about the implementation of "SemanticChunker"? Haha! From what I know, the implementation of semantic chunking might be: dividing the original text into semantic units (sentences/paragraphs/combinations of m sentences, etc.), calculating the similarity between each pair of units, and low-similarity parts will be used as breakpoints.
I'm not sure if my understanding is correct?
Additionally, my working language is Chinese, and when it comes to word segmentation and other operations, I must be cautious with some open-source projects because language inconsistency may complicate issues. I look forward to you making some basic elements transparent or customizable, such as the determination conditions for sentence boundaries in "SentenceChunker" (English punctuation or including Chinese punctuation as well?).
Thank you again!
The text was updated successfully, but these errors were encountered: