-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Advanced regex based parsing + XML+ chunk metadata #148
Comments
@finnschwall this could be related to #150? The PR I issued is designed to be a stripped down version, but my next project would be working on a true Hierarchical chunker class that operates how you described. By setting patterns for your key document structure, the basic idea is that this content would be found no matter how far "above" your prefix chunk the content was, you would extract a fixed window from that point, and then take the remaining tokens from the prefix chunk. As an example, if your key document structure was
Is this similar enough to what you were envisioning here? |
Hello :) yes that sounds very similar to what I envision/currently use.
would return one chunk if the text in chapter one is smaller than context_size . However if the text size is larger, it will not be cut off using a fixed width, but rather the entire chunk would be run through the same algorithm again. But now with different split points, i.e., level 2, a more finegrained version. In my experience this approach is more robust to "weird" choices of headings. E.g. in many latex documents the usage of \chapter \section \subsection etc. varies a lot. But inside a document a \chapter is always "larger" than a \subsection. Using the leveled approach it still produces good results if e.g. somebody just used three different \section in their 300 page book, since the \subsection etc will still be resolved as intended by the author. One could also just throw in all the different splittings without levels, but (in my experience) this produces weird results as things get mixed that should not get mixed. E.g. you could merge two small chunks where one was the last \subsubsection of a chapter and the other is the intro text to a new chapter. Those are completely disjoint content wise. This of course only makes sense for highly structured documents where levels are meaningful/even exist. But the levels could be completely optional and we could give example configs for some common structured data formats (e.g. Markdown, XML, Latex). The other addition that is currently missing in chonkie which I think would be great, is the ability to get a form of metadata from chunks. E.g. if in the above we get three chunks the chunk from Subchapter One has a metadata field telling me it's level 1 split origin is Chapter One and it's level 2 split origin is Subchapter One This allows e.g. for wikipedia exports to create a link that directly leads to the section of text the LLM used. E.g. https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)#Un-embedding I also use this to assign chunks pages in a PDF (or a range) since a lot of PDFs store a table of content that can be accessed programmatically. This allows for really cool RAG features. For example you can make a RAG bot for a programms documentation where the user can directly jump to the page/function the bot references. |
📋 Quick Check
*Closely related to this
💡 Feature Description
For a research project, I wrote a chunker specifically for highly structured documents. Latex, Wikipedia pages (XML), markdown and all that can be converted easily to markdown (e.g. html, docx, etc).
The basic idea is to utilize the structure of the documents. The motivation is that the author has likely made some conscious decisions for their sectioning and that this is directly suitable for RAG.
E.g. for latex first look for sections, if they are too large look for subsections, then for subsubsections paragraphs and so forth. Then combine stuff that is too short and use a fallback for those that are still too large (e.g. semantic chunking).
It’s similar to the recursive chunker (with added regex support) but adheres more strongly to the original document structure by going through the document in different granularities.
E.g. for latex the algorithm will only split a subsection by paragraphs if it's too large since paragraphs belong to another granularity level.
This keeps chunks more in line with how the author wanted to split the document.
I also have some tested configs (regex rules) that work well. So far, results and user feedback for the actual citations inside a full rag system have been positive.
It additionally allows for chunk metadata e.g. the chapter/section title, included figures etc.
I also wrote a page matcher that can assign page numbering to chunks by using a pdf and looking for the chunk content inside that pdf.
That is quite helpful for proper citations.
I think it would be a great fit for chonkie, since it only requires 2-4 small libraries (advanced regex and html to markdown), it's easily customizable for other document types and is fast. The only major dependency is PyMuPDF to read PDFs for the page matching. But it works fine without it/that is optional.
Personally, I prefer the chunks with this method (especially for RAG with citations) since it's very easy for a human to "understand" the chunks as the chunking has basically been done by a human. So it would be wasted in yet another unseen github project :)
However this is a lot more than just a few lines of code and so far it’s only a scratch project since it’s been written specifically (and hastily) for a single study. So writing it properly and making it fitting for chonkie would take some time.
So before making the effort of planning an actual PR/new API I’d love to get some feedback if you think this (or parts of it) would fit into chonkie.
(If so, probably as a replacement to the existing recursive chunker?)
🛠️ Implementation Approach
Already roughly implemented. Most important files:
Regex parsing
PDF page matcher
Wiki XML reader
The text was updated successfully, but these errors were encountered: