-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rag pdf 2 #955
base: dev
Are you sure you want to change the base?
Rag pdf 2 #955
Conversation
- Using simplified APIs - using pdf --> markdown extraction - incorporated deduping documents Signed-off-by: Sujee Maniyam <[email protected]>
- renamed the example to 'rag-pdf-1' - Migrated to simpler APIs - using release 1.0.0.a4 Signed-off-by: Sujee Maniyam <[email protected]>
@sujee I ran into an execution error in Step 5 (doc chunk), when using the Ray notebook (see below). Looking at the notebook output in your branch, I see that it ran successfully for you. Any ideas about this, or should we ask @dolfim-ibm? 🏃🏼 STAGE-3: Processing input='output/02_dedupe_out' --> output='output/03_chunk_out' 10:33:03 INFO - doc_chunk parameters are : {'chunking_type': 'li_markdown', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None} (orchestrate pid=17880) created [Actor(RayTransformFileProcessor, 3f358090da9fc2f7cc37f44b01000000), Actor(RayTransformFileProcessor, 6f4c40920f177346aa9009cb01000000)], alive [ActorState(actor_id='3f358090da9fc2f7cc37f44b01000000', class_name='RayTransformFileProcessor', state='ALIVE', job_id='01000000', name='', node_id='09dc080c8d65e2281f6d8db7b8910cb5fda78174e9757d51c60290fd', pid=17885, ray_namespace='e6b73e6d-9108-49e3-a3f1-f638b625ab47', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None)] (orchestrate pid=17880) Traceback (most recent call last): Exception Traceback (most recent call last) Exception: ❌ Stage:3 failed |
@sujee Thanks for the "storage" tip! I freed up storage on my Mac and got the Ray notebook working without error. I have now tested all the notebooks in this example successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@matouma @shahrokhDaijavad hold off on this one please
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait until the latest changes by @sujee
Signed-off-by: Sujee Maniyam <[email protected]>
@matouma @shahrokhDaijavad this PR is ready for merge. thx |
Signed-off-by: Sujee Maniyam <[email protected]>
Signed-off-by: Sujee Maniyam <[email protected]>
@sujee Just tested the latest version of today. I tested all notebooks, including the ray one, and everything worked. I approve this again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Sujee Maniyam <[email protected]>
Why are these changes needed?
updated RAG-PDF example
Related issue number (if any).
#954