Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Document Chunker to always use docling #430

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

khaledsulayman
Copy link
Member

@khaledsulayman khaledsulayman commented Dec 5, 2024

The old DocumentChunker was a factory class that called the text-splitter on markdowns and docling on PDFs. In reality, we want to call docling and then use the text-splitter on all document types. This change refactors the DocumentChunker class to always call docling (as long as the provided documents are supported filetypes).

Resolves: #334

@mergify mergify bot added testing Relates to testing ci-failure labels Dec 5, 2024
@aakankshaduggal aakankshaduggal requested a review from a team December 6, 2024 17:41
@khaledsulayman khaledsulayman marked this pull request as draft December 6, 2024 20:25
@mergify mergify bot added ci-failure and removed ci-failure labels Dec 18, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 6, 2025
Signed-off-by: Aakanksha Duggal <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Jan 7, 2025
@mergify mergify bot removed the ci-failure label Jan 8, 2025
@aakankshaduggal aakankshaduggal marked this pull request as ready for review January 8, 2025 15:44
@aakankshaduggal aakankshaduggal requested a review from a team January 8, 2025 15:44
Copy link

github-actions bot commented Jan 8, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

github-actions bot commented Jan 8, 2025

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass through this and have a few questions / comments. Nothing big, but just to clarify some comments in the code as well as some intended behavior. Thanks for the new tests here!

Comment on lines +84 to +93
# def __new__(
# cls,
# leaf_node,
# taxonomy_path,
# output_dir: Path,
# server_ctx_size=4096,
# chunk_word_count=1024,
# tokenizer_model_name: Optional[str] = None,
# docling_model_path: Optional[str] = None,
# ):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment? It looks like it should be safe to remove to keep things tidy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take this away.

Comment on lines +109 to +110
if len(document_dict) > 1:
raise ValueError("Provided multiple document types")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new requirement, that users can only provide a single type of document per leaf node? In other words, I cannot have pdf, markdown, or other documents within a leaf node? I don't think we enforced this before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure myself about this. @khaledsulayman can answer this better.

model_path = Path(model_name)
model_path = Path(
model_name
) # TODO expect a path from the DocumentChunker constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO something we want to address in this PR, or is this a comment for some future work we want to track?

elif book_element["prov"]:
current_book_page_number = book_element["prov"][0][
"page"
] # TODO export to function to handle empty ["prov"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for this PR? Or future work?

Comment on lines +115 to +116
We use this to catch markdown files that may contain html elements since
docling does not support this."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary fix until Docling supports html within markdown? Or is this something we expect to keep longer-term? We might want to open a feature request to docling for it to better handle this scenario so we can work towards removing this check entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be a temporary fix for now. I will do the needful and open a feature request to docling.

@@ -411,20 +429,20 @@ def map_chunks_to_icls(chunks: List, leaf_node: Dict) -> Dataset:

def _knowledge_leaf_node_to_samples(
leaf_node,
taxonomy_path,
taxonomy_path, # pylint: disable=unused-argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we keeping this unused param for some backwards compatibility reason?

Comment on lines +316 to +319
model_family="granite",
model_name=os.path.join(
TEST_DATA_DIR, "models/instructlab/granite-7b-lab"
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the model family and name here? We don't use granite as a teacher model, so it feels odd to use it here in the test.

document:
repo: https://github.com/luke-inglis/il-anatomy-knowledge
commit: cc7c6ca
repo: https://github.com/RedHatOfficial/rhelai-taxonomy-data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to refer to a RHEL AI taxonomy here? Do members of the InstructLab team control this repository sufficiently to tweak it as needed if we need to adjust what we're testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunking Refactor: Always use Context-Aware Chunker
4 participants