[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

RemixaWorld · 2024-12-01T06:35:04Z

First of all, thank you for your contributions, designing chunking out of the complex process is absolutely significant!
Master Roshi, may I ask about the implementation of "SemanticChunker"? Haha! From what I know, the implementation of semantic chunking might be: dividing the original text into semantic units (sentences/paragraphs/combinations of m sentences, etc.), calculating the similarity between each pair of units, and low-similarity parts will be used as breakpoints.
I'm not sure if my understanding is correct?

Additionally, my working language is Chinese, and when it comes to word segmentation and other operations, I must be cautious with some open-source projects because language inconsistency may complicate issues. I look forward to you making some basic elements transparent or customizable, such as the determination conditions for sentence boundaries in "SentenceChunker" (English punctuation or including Chinese punctuation as well?).

Thank you again!

bhavnicksm · 2024-12-02T14:12:43Z

Hey @RemixaWorld!

Thanks for raising an issue! 😁

I would love to work further with you on incorporating support for Chinese as well. Chonkie always tries to be as open as possible but by default uses English. Except for pre-splitting in some chunkers, they are easily extended to languages other than English.

If you could tell me what punctuations you'd need for Chinese, I can add support for Chinese by default. Again, I want Chonkie to be the go to for you, so if multilingual support would help, I'd be happy to support it 😄

Thanks! 🙏

RemixaWorld · 2024-12-03T04:52:23Z

@bhavnicksm Thanks for your reply!
In terms of Chonkie's ease of use, absolutely!
The conventional demarcation punctuation in Chinese is

。
！
？
Or in conversation:
。”
！”
？”

bhavnicksm · 2024-12-05T13:27:39Z

Hey @RemixaWorld, awesome!

Thanks for the punctuations~ Will try to expose the separation punctuations on the chunkers so it's easy to extend to newer languages at the moment.

Once a proper scalable plan to support languages is finalised, it will eventually look something like a parameter on the chunkers like lang='zh' in the future. But that might still take a while. I hope you continue supporting Chonkie till it reaches that point.

Thanks! 😊

RemixaWorld · 2024-12-05T16:44:02Z

@bhavnicksm Sure, come on!

bhavnicksm · 2024-12-06T21:40:10Z

Hey @RemixaWorld!

Added preliminary support by exposing the delimiters in #81, you can now do something of this sort:

from chonkie import SemanticChunker

sc = SemanticChunker(delim="。!?")

chunks = sc("some text here")

Let me know if this works for you and any other suggestions you'd have! Closing this issue for now...

Thanks! 😊

RemixaWorld added the enhancement New feature or request label Dec 1, 2024

RemixaWorld assigned bhavnicksm Dec 1, 2024

bhavnicksm closed this as completed Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

RemixaWorld commented Dec 1, 2024

bhavnicksm commented Dec 2, 2024

RemixaWorld commented Dec 3, 2024

bhavnicksm commented Dec 5, 2024 •

edited

Loading

RemixaWorld commented Dec 5, 2024

bhavnicksm commented Dec 6, 2024

[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

[FEAT]Regarding the implementation principles of SemanticChunker and some flexibility requirements #76

Comments

RemixaWorld commented Dec 1, 2024

bhavnicksm commented Dec 2, 2024

RemixaWorld commented Dec 3, 2024

bhavnicksm commented Dec 5, 2024 • edited Loading

RemixaWorld commented Dec 5, 2024

bhavnicksm commented Dec 6, 2024

bhavnicksm commented Dec 5, 2024 •

edited

Loading