Refactor Document Chunker to always use docling #430

khaledsulayman · 2024-12-05T22:10:55Z

The old DocumentChunker was a factory class that called the text-splitter on markdowns and docling on PDFs. In reality, we want to call docling and then use the text-splitter on all document types. This change refactors the DocumentChunker class to always call docling (as long as the provided documents are supported filetypes).

Resolves: #334

Signed-off-by: Khaled Sulayman <[email protected]>

src/instructlab/sdg/utils/chunkers.py

Signed-off-by: Khaled Sulayman <[email protected]>

Signed-off-by: Aakanksha Duggal <[email protected]>

github-actions · 2025-01-08T17:00:49Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-01-08T19:25:45Z

e2e workflow succeeded on this PR: View run, congrats!

bbrowning

I took a first pass through this and have a few questions / comments. Nothing big, but just to clarify some comments in the code as well as some intended behavior. Thanks for the new tests here!

bbrowning · 2025-01-09T18:01:02Z

src/instructlab/sdg/utils/chunkers.py

+    # def __new__(
+    #     cls,
+    #     leaf_node,
+    #     taxonomy_path,
+    #     output_dir: Path,
+    #     server_ctx_size=4096,
+    #     chunk_word_count=1024,
+    #     tokenizer_model_name: Optional[str] = None,
+    #     docling_model_path: Optional[str] = None,
+    # ):


Do we need this comment? It looks like it should be safe to remove to keep things tidy.

Will take this away.

bbrowning · 2025-01-09T18:02:15Z

src/instructlab/sdg/utils/chunkers.py

+        if len(document_dict) > 1:
+            raise ValueError("Provided multiple document types")


Is this a new requirement, that users can only provide a single type of document per leaf node? In other words, I cannot have pdf, markdown, or other documents within a leaf node? I don't think we enforced this before.

I am unsure myself about this. @khaledsulayman can answer this better.

bbrowning · 2025-01-09T18:04:17Z

src/instructlab/sdg/utils/chunkers.py

-        model_path = Path(model_name)
+        model_path = Path(
+            model_name
+        )  # TODO expect a path from the DocumentChunker constructor


Is this TODO something we want to address in this PR, or is this a comment for some future work we want to track?

bbrowning · 2025-01-09T18:04:52Z

src/instructlab/sdg/utils/chunkers.py

+                elif book_element["prov"]:
+                    current_book_page_number = book_element["prov"][0][
+                        "page"
+                    ]  # TODO export to function to handle empty ["prov"]


TODO for this PR? Or future work?

bbrowning · 2025-01-09T18:07:10Z

src/instructlab/sdg/utils/taxonomy.py

+    We use this to catch markdown files that may contain html elements since
+    docling does not support this."""


Is this a temporary fix until Docling supports html within markdown? Or is this something we expect to keep longer-term? We might want to open a feature request to docling for it to better handle this scenario so we can work towards removing this check entirely.

I believe this should be a temporary fix for now. I will do the needful and open a feature request to docling.

bbrowning · 2025-01-09T18:07:51Z

src/instructlab/sdg/utils/taxonomy.py

@@ -411,20 +429,20 @@ def map_chunks_to_icls(chunks: List, leaf_node: Dict) -> Dataset:

 def _knowledge_leaf_node_to_samples(
    leaf_node,
-    taxonomy_path,
+    taxonomy_path,  # pylint: disable=unused-argument


Are we keeping this unused param for some backwards compatibility reason?

bbrowning · 2025-01-09T18:09:26Z

tests/test_generate_data.py

+                model_family="granite",
+                model_name=os.path.join(
+                    TEST_DATA_DIR, "models/instructlab/granite-7b-lab"
+                ),


Why do we need to change the model family and name here? We don't use granite as a teacher model, so it feels odd to use it here in the test.

bbrowning · 2025-01-09T18:11:25Z

tests/testdata/test_valid_knowledge_skill.yaml

 document:
-  repo: https://github.com/luke-inglis/il-anatomy-knowledge
-  commit: cc7c6ca
+  repo: https://github.com/RedHatOfficial/rhelai-taxonomy-data


Is it ok to refer to a RHEL AI taxonomy here? Do members of the InstructLab team control this repository sufficiently to tweak it as needed if we need to adjust what we're testing?

khaledsulayman added 2 commits December 5, 2024 16:46

Replace DocumentChunker factory with docling-based chunker by default

ed02973

Signed-off-by: Khaled Sulayman <[email protected]>

Modify tests to use new DocumentChunker interface

a6d06d0

Signed-off-by: Khaled Sulayman <[email protected]>

mergify bot added testing Relates to testing ci-failure labels Dec 5, 2024

aakankshaduggal requested a review from a team December 6, 2024 17:41

RobotSail reviewed Dec 6, 2024

View reviewed changes

src/instructlab/sdg/utils/chunkers.py Outdated Show resolved Hide resolved

RobotSail reviewed Dec 6, 2024

View reviewed changes

src/instructlab/sdg/utils/chunkers.py Outdated Show resolved Hide resolved

khaledsulayman marked this pull request as draft December 6, 2024 20:25

khaledsulayman added 2 commits December 17, 2024 16:48

Check for html in markdown files and error out

9457380

Signed-off-by: Khaled Sulayman <[email protected]>

replace test_valid_knowledge_skill.yaml with example with no html

1722161

Signed-off-by: Khaled Sulayman <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Dec 18, 2024

mergify bot added ci-failure and removed ci-failure labels Jan 6, 2025

Update for ruff and pylint issues

309fd11

Signed-off-by: Aakanksha Duggal <[email protected]>

aakankshaduggal force-pushed the ks-chunking-refactor branch from f62567d to 309fd11 Compare January 6, 2025 18:59

mergify bot added ci-failure and removed ci-failure labels Jan 6, 2025

Update chunkers.py with minor cleanup

0b9a2fd

Signed-off-by: Aakanksha Duggal <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Jan 7, 2025

Update test_chunkers to be more modular

d4cc458

Signed-off-by: Aakanksha Duggal <[email protected]>

aakankshaduggal force-pushed the ks-chunking-refactor branch from c84fa40 to d4cc458 Compare January 7, 2025 21:40

mergify bot added the ci-failure label Jan 7, 2025

Update HTML error to warning to avoid exiting

bab135c

Signed-off-by: Aakanksha Duggal <[email protected]>

mergify bot removed the ci-failure label Jan 8, 2025

aakankshaduggal marked this pull request as ready for review January 8, 2025 15:44

aakankshaduggal requested a review from a team January 8, 2025 15:44

bbrowning reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Document Chunker to always use docling #430

Refactor Document Chunker to always use docling #430

khaledsulayman commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

bbrowning left a comment

bbrowning Jan 9, 2025

aakankshaduggal Jan 9, 2025

bbrowning Jan 9, 2025

aakankshaduggal Jan 9, 2025

bbrowning Jan 9, 2025

bbrowning Jan 9, 2025

bbrowning Jan 9, 2025

aakankshaduggal Jan 9, 2025

bbrowning Jan 9, 2025

bbrowning Jan 9, 2025

bbrowning Jan 9, 2025

		if len(document_dict) > 1:
		raise ValueError("Provided multiple document types")

		We use this to catch markdown files that may contain html elements since
		docling does not support this."""

Refactor Document Chunker to always use docling #430

Are you sure you want to change the base?

Refactor Document Chunker to always use docling #430

Conversation

khaledsulayman commented Dec 5, 2024 • edited Loading

github-actions bot commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

bbrowning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaledsulayman commented Dec 5, 2024 •

edited

Loading