-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Document Chunker to always use docling #430
base: main
Are you sure you want to change the base?
Refactor Document Chunker to always use docling #430
Conversation
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
f62567d
to
309fd11
Compare
Signed-off-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
c84fa40
to
d4cc458
Compare
Signed-off-by: Aakanksha Duggal <[email protected]>
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass through this and have a few questions / comments. Nothing big, but just to clarify some comments in the code as well as some intended behavior. Thanks for the new tests here!
# def __new__( | ||
# cls, | ||
# leaf_node, | ||
# taxonomy_path, | ||
# output_dir: Path, | ||
# server_ctx_size=4096, | ||
# chunk_word_count=1024, | ||
# tokenizer_model_name: Optional[str] = None, | ||
# docling_model_path: Optional[str] = None, | ||
# ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this comment? It looks like it should be safe to remove to keep things tidy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take this away.
if len(document_dict) > 1: | ||
raise ValueError("Provided multiple document types") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new requirement, that users can only provide a single type of document per leaf node? In other words, I cannot have pdf, markdown, or other documents within a leaf node? I don't think we enforced this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure myself about this. @khaledsulayman can answer this better.
model_path = Path(model_name) | ||
model_path = Path( | ||
model_name | ||
) # TODO expect a path from the DocumentChunker constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO something we want to address in this PR, or is this a comment for some future work we want to track?
elif book_element["prov"]: | ||
current_book_page_number = book_element["prov"][0][ | ||
"page" | ||
] # TODO export to function to handle empty ["prov"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO for this PR? Or future work?
We use this to catch markdown files that may contain html elements since | ||
docling does not support this.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a temporary fix until Docling supports html within markdown? Or is this something we expect to keep longer-term? We might want to open a feature request to docling for it to better handle this scenario so we can work towards removing this check entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be a temporary fix for now. I will do the needful and open a feature request to docling.
@@ -411,20 +429,20 @@ def map_chunks_to_icls(chunks: List, leaf_node: Dict) -> Dataset: | |||
|
|||
def _knowledge_leaf_node_to_samples( | |||
leaf_node, | |||
taxonomy_path, | |||
taxonomy_path, # pylint: disable=unused-argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we keeping this unused param for some backwards compatibility reason?
model_family="granite", | ||
model_name=os.path.join( | ||
TEST_DATA_DIR, "models/instructlab/granite-7b-lab" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change the model family and name here? We don't use granite as a teacher model, so it feels odd to use it here in the test.
document: | ||
repo: https://github.com/luke-inglis/il-anatomy-knowledge | ||
commit: cc7c6ca | ||
repo: https://github.com/RedHatOfficial/rhelai-taxonomy-data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to refer to a RHEL AI taxonomy here? Do members of the InstructLab team control this repository sufficiently to tweak it as needed if we need to adjust what we're testing?
The old DocumentChunker was a factory class that called the text-splitter on markdowns and docling on PDFs. In reality, we want to call docling and then use the text-splitter on all document types. This change refactors the DocumentChunker class to always call docling (as long as the provided documents are supported filetypes).
Resolves: #334