[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

finnschwall · 2025-01-15T14:22:57Z

📋 Quick Check

[x*] I've checked this feature isn't already implemented or proposed
This feature is relevant to Chonkie's purpose (text chunking for RAG)
*Closely related to this

💡 Feature Description

For a research project, I wrote a chunker specifically for highly structured documents. Latex, Wikipedia pages (XML), markdown and all that can be converted easily to markdown (e.g. html, docx, etc).
The basic idea is to utilize the structure of the documents. The motivation is that the author has likely made some conscious decisions for their sectioning and that this is directly suitable for RAG.
E.g. for latex first look for sections, if they are too large look for subsections, then for subsubsections paragraphs and so forth. Then combine stuff that is too short and use a fallback for those that are still too large (e.g. semantic chunking).
It’s similar to the recursive chunker (with added regex support) but adheres more strongly to the original document structure by going through the document in different granularities.
E.g. for latex the algorithm will only split a subsection by paragraphs if it's too large since paragraphs belong to another granularity level.
This keeps chunks more in line with how the author wanted to split the document.

I also have some tested configs (regex rules) that work well. So far, results and user feedback for the actual citations inside a full rag system have been positive.

It additionally allows for chunk metadata e.g. the chapter/section title, included figures etc.
I also wrote a page matcher that can assign page numbering to chunks by using a pdf and looking for the chunk content inside that pdf.
That is quite helpful for proper citations.

I think it would be a great fit for chonkie, since it only requires 2-4 small libraries (advanced regex and html to markdown), it's easily customizable for other document types and is fast. The only major dependency is PyMuPDF to read PDFs for the page matching. But it works fine without it/that is optional.
Personally, I prefer the chunks with this method (especially for RAG with citations) since it's very easy for a human to "understand" the chunks as the chunking has basically been done by a human. So it would be wasted in yet another unseen github project :)

However this is a lot more than just a few lines of code and so far it’s only a scratch project since it’s been written specifically (and hastily) for a single study. So writing it properly and making it fitting for chonkie would take some time.

So before making the effort of planning an actual PR/new API I’d love to get some feedback if you think this (or parts of it) would fit into chonkie.
(If so, probably as a replacement to the existing recursive chunker?)

🛠️ Implementation Approach

Already roughly implemented. Most important files:
Regex parsing
PDF page matcher
Wiki XML reader

Sankgreall · 2025-01-26T17:22:55Z

@finnschwall this could be related to #150?

The PR I issued is designed to be a stripped down version, but my next project would be working on a true Hierarchical chunker class that operates how you described. By setting patterns for your key document structure, the basic idea is that this content would be found no matter how far "above" your prefix chunk the content was, you would extract a fixed window from that point, and then take the remaining tokens from the prefix chunk.

As an example, if your key document structure was # to capture the last major heading, your chunk could end up looking something like this:

# Chapter One

Once upon a time in a land far far away... [document continues]

[chunk prefix]... and then she saw the elf from across the road and knew that he had the baseball.

Is this similar enough to what you were envisioning here?

finnschwall · 2025-01-30T09:41:13Z

Hello :)

yes that sounds very similar to what I envision/currently use.
Would you be open to implement it with "levels" in the document? E.g. this document

Chapter One

Once upon a time... [more text]

Subchapter One

[more text]

Subchapter Two

[more text]

would return one chunk if the text in chapter one is smaller than context_size .

However if the text size is larger, it will not be cut off using a fixed width, but rather the entire chunk would be run through the same algorithm again. But now with different split points, i.e., level 2, a more finegrained version.
This would then return 2-3 chunks based on the subchapter headings and how long their content is.

In my experience this approach is more robust to "weird" choices of headings. E.g. in many latex documents the usage of \chapter \section \subsection etc. varies a lot. But inside a document a \chapter is always "larger" than a \subsection.

Using the leveled approach it still produces good results if e.g. somebody just used three different \section in their 300 page book, since the \subsection etc will still be resolved as intended by the author.

One could also just throw in all the different splittings without levels, but (in my experience) this produces weird results as things get mixed that should not get mixed. E.g. you could merge two small chunks where one was the last \subsubsection of a chapter and the other is the intro text to a new chapter. Those are completely disjoint content wise.

This of course only makes sense for highly structured documents where levels are meaningful/even exist. But the levels could be completely optional and we could give example configs for some common structured data formats (e.g. Markdown, XML, Latex).

The other addition that is currently missing in chonkie which I think would be great, is the ability to get a form of metadata from chunks. E.g. if in the above we get three chunks the chunk from Subchapter One has a metadata field telling me it's level 1 split origin is Chapter One and it's level 2 split origin is Subchapter One

This allows e.g. for wikipedia exports to create a link that directly leads to the section of text the LLM used. E.g. https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)#Un-embedding

I also use this to assign chunks pages in a PDF (or a range) since a lot of PDFs store a table of content that can be accessed programmatically.

This allows for really cool RAG features. For example you can make a RAG bot for a programms documentation where the user can directly jump to the page/function the bot references.

finnschwall added the enhancement New feature or request label Jan 15, 2025

finnschwall assigned bhavnicksm Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

finnschwall commented Jan 15, 2025

Sankgreall commented Jan 26, 2025 •

edited

Loading

finnschwall commented Jan 30, 2025

Chapter One

Subchapter One

Subchapter Two

[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

[FEAT] Advanced regex based parsing + XML+ chunk metadata #148

Comments

finnschwall commented Jan 15, 2025

📋 Quick Check

💡 Feature Description

🛠️ Implementation Approach

Sankgreall commented Jan 26, 2025 • edited Loading

finnschwall commented Jan 30, 2025

Chapter One

Subchapter One

Subchapter Two

Sankgreall commented Jan 26, 2025 •

edited

Loading