Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rag pdf 2 #955

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
Open

Rag pdf 2 #955

wants to merge 9 commits into from

Conversation

sujee
Copy link
Contributor

@sujee sujee commented Jan 20, 2025

Why are these changes needed?

updated RAG-PDF example

  • renamed the example to 'rag-pdf-1'
  • Migrated to simpler APIs
  • using release 1.0.0.a4

Related issue number (if any).

#954

sujee and others added 3 commits January 12, 2025 23:33
- Using simplified APIs
- using pdf --> markdown extraction
- incorporated deduping documents

Signed-off-by: Sujee Maniyam <[email protected]>
- renamed the example to 'rag-pdf-1'
- Migrated to simpler APIs
- using release 1.0.0.a4

Signed-off-by: Sujee Maniyam <[email protected]>
@shahrokhDaijavad shahrokhDaijavad self-assigned this Jan 21, 2025
@shahrokhDaijavad
Copy link
Member

@sujee I ran into an execution error in Step 5 (doc chunk), when using the Ray notebook (see below). Looking at the notebook output in your branch, I see that it ran successfully for you. Any ideas about this, or should we ask @dolfim-ibm?

🏃🏼 STAGE-3: Processing input='output/02_dedupe_out' --> output='output/03_chunk_out'

10:33:03 INFO - doc_chunk parameters are : {'chunking_type': 'li_markdown', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}
10:33:03 INFO - pipeline id pipeline_id
10:33:03 INFO - code location None
10:33:03 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}
10:33:03 INFO - actor creation delay 0
10:33:03 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}
10:33:03 INFO - data factory data_ is using local data access: input_folder - output/02_dedupe_out output_folder - output/03_chunk_out
10:33:03 INFO - data factory data_ max_files -1, n_sample -1
10:33:03 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
10:33:03 INFO - Running locally
2025-01-21 10:33:06,168 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(orchestrate pid=17880) 10:33:08 INFO - orchestrator started at 2025-01-21 10:33:08
(orchestrate pid=17880) 10:33:08 INFO - Number of files is 3, source profile {'max_file_size': 0.04471015930175781, 'min_file_size': 0.0028095245361328125, 'total_file_size': 0.06870079040527344}
(orchestrate pid=17880) 10:33:08 INFO - Cluster resources: {'cpus': 10, 'gpus': 0, 'memory': 3.8950469978153706, 'object_store': 1.9475234979763627}
(orchestrate pid=17880) 10:33:08 INFO - Number of workers - 2 with {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1} each
(raylet) [2025-01-21 10:33:15,124 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2844246016; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:33:25,222 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2834575360; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:33:35,321 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2835984384; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:33:45,330 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2834739200; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:33:55,348 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2833797120; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:05,369 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2827829248; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:15,370 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2825523200; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:25,375 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2822332416; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:35,384 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2830008320; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:45,397 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2828980224; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:34:55,494 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2831609856; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:35:05,502 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2830020608; capacity: 494384795648. Object creation will fail if spilling is required.
(raylet) [2025-01-21 10:35:15,505 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2827534336; capacity: 494384795648. Object creation will fail if spilling is required.

(orchestrate pid=17880) created [Actor(RayTransformFileProcessor, 3f358090da9fc2f7cc37f44b01000000), Actor(RayTransformFileProcessor, 6f4c40920f177346aa9009cb01000000)], alive [ActorState(actor_id='3f358090da9fc2f7cc37f44b01000000', class_name='RayTransformFileProcessor', state='ALIVE', job_id='01000000', name='', node_id='09dc080c8d65e2281f6d8db7b8910cb5fda78174e9757d51c60290fd', pid=17885, ray_namespace='e6b73e6d-9108-49e3-a3f1-f638b625ab47', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None)]

(orchestrate pid=17880) Traceback (most recent call last):
(orchestrate pid=17880) File "/opt/anaconda3/envs/data-prep-kit-1/lib/python3.11/site-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=17880) processors = RayUtils.create_actors(
(orchestrate pid=17880) ^^^^^^^^^^^^^^^^^^^^^^^
(orchestrate pid=17880) File "/opt/anaconda3/envs/data-prep-kit-1/lib/python3.11/site-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=17880) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=17880) data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 1 alive
(orchestrate pid=17880) 10:35:18 ERROR - Exception during execution out of 2 created actors only 1 alive: None
(raylet) [2025-01-21 10:35:25,513 E 17869 20410782] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-01-21_10-33-04_045244_16036 is over 95% full, available space: 2829774848; capacity: 494384795648. Object creation will fail if spilling is required.
10:35:28 INFO - Completed execution in 2.41 min, execution result 1


Exception Traceback (most recent call last)
File :21

Exception: ❌ Stage:3 failed

@shahrokhDaijavad shahrokhDaijavad self-requested a review January 21, 2025 18:51
@shahrokhDaijavad
Copy link
Member

@sujee Thanks for the "storage" tip! I freed up storage on my Mac and got the Ray notebook working without error. I have now tested all the notebooks in this example successfully.
@touma-I I am approving this PR, because this is definitely better than the current rag example we have in the repo and will replace that.

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sujee
Copy link
Contributor Author

sujee commented Jan 27, 2025

@matouma @shahrokhDaijavad hold off on this one please

  • I am going to bump it upto 1.0.0
  • and only install needed transforms

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait until the latest changes by @sujee

@sujee
Copy link
Contributor Author

sujee commented Jan 31, 2025

@matouma @shahrokhDaijavad this PR is ready for merge. thx

sujee added 2 commits January 31, 2025 10:39
Signed-off-by: Sujee Maniyam <[email protected]>
Signed-off-by: Sujee Maniyam <[email protected]>
@shahrokhDaijavad
Copy link
Member

@sujee Just tested the latest version of today. I tested all notebooks, including the ray one, and everything worked. I approve this again.

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants