PDF Inventory where title is PMID #21

quang-ng · 2025-01-08T09:15:39Z

Work for issue #22

quang-ng · 2025-01-22T07:57:21Z

the file changes look like crazy now.!!!! Create new branch for this ticket.

…sTitleIsPMID class. Add unit tests for OddpubWrapper to ensure PDF processing and S3 inventory functionality. Enhance error handling and logging.

…d robustness.

…file for OddpubWrapper. Enhance error handling and logging in S3 inventory processing.

leej3

Thanks @quang-ng. It looks like it is coming along nicely. There are a couple of clarifications on the objective I have requested from @joshlawrimore.

Overall I think it's almost there. I would prefer to see more code reuse though. PDF uploads and the associated population of tables has already done. Hopefully we can reuse that.

dsst_etl/upload_pdfs_title_is_pmid.py

leej3 · 2025-01-23T13:19:45Z

dsst_etl/upload_pdfs_title_is_pmid.py

+        required=True,
+        help="The database connection URL. This should be a valid SQLAlchemy database URL.",
+    )
+    parser.add_argument(


3538 pdfs are in the osm-pdf-uploads bucket

The issue specifies the above input. Perhaps the code should be able to take an s3 url or a local path? @joshlawrimore any preferences?

I removed this argument, we only get pdfs files from predefine bucket

Both would be preferable given my fondness for local user testing

dsst_etl/upload_pdfs_title_is_pmid.py

leej3 · 2025-01-23T13:45:43Z

dsst_etl/upload_pdfs_title_is_pmid.py

+            document = Documents(
+                hash_data=file_hash,
+                s3uri=f"s3://{self.bucket_name}/{key}",
+                provenance_id=provenance.id,


@joshlawrimore should each of these documents have it's own provenance entry?
I think the options would be:

it would seem that a single script execution could describe the upload/oddpub processing for all entries in the Document table and the Rtransparent/Oddpub/Processing_X table.
The schema may already imply the answer here so apologies if that is so! My understanding was that we would have a provenance table entry for every operation so:
An entry for upload, and oddpub execution for every pdf.

Correct. A single provanance id corresponding to the inventory/oddpub analysis should be applies to all the document table entries. I haven't run this code locally yet, but I think that is what should happen as this is writen.

dsst_etl/upload_pdfs_title_is_pmid.py

leej3 · 2025-01-23T13:51:53Z

dsst_etl/upload_pdfs_title_is_pmid.py

+from .config import config
+
+
+class UploadPDFsTitleIsPMID:


I'm not sure about the choice of a class here. Much of this functionality is shared with pre-existing scripts. Ideally we would reuse that functionality, making it more generalisable where necessary to fit our purposes.
It may be the appropriate level of encapsulation though so feel free to explain the pros/cons of this approach vs another.

Suggested change

class UploadPDFsTitleIsPMID:

class DocumentInventoryPMID:

You are not really uploading anything. I used the wrong word Issue 22. The idea is to create a document inventory where the identifer the is PMID in the title.

Removing duplicate hash checks

…rity during PDF uploads.

…logic, and removed redundant logging.

joshlawrimore · 2025-01-24T14:50:04Z

dsst_etl/upload_pdfs_title_is_pmid.py

+
+class UploadPDFsTitleIsPMID:
+    """
+    Uploads PDFs to S3 where the title is the PMID.


Suggested change

Uploads PDFs to S3 where the title is the PMID.

Inventory PDFs to S3 where the title is the PMID.

joshlawrimore · 2025-01-24T14:51:43Z

dsst_etl/upload_pdfs_title_is_pmid.py

+            version=__version__,
+            compute=get_compute_context_id(),
+            personnel=config.HOSTNAME,
+            comment="Upload PDFs where the title is the PMID",


Suggested change

comment="Upload PDFs where the title is the PMID",

comment="Creating document inventory of PDFs where the title is the PMID",

joshlawrimore · 2025-01-24T14:53:26Z

dsst_etl/upload_pdfs_title_is_pmid.py

+    def _create_provenance_entry(self):
+        # Creates a provenance entry to track the current upload process
+        provenance = Provenance(
+            pipeline_name="Oddpub Analysis",


Suggested change

pipeline_name="Oddpub Analysis",

pipeline_name="PMID Inventory and Oddpub Analysis",

joshlawrimore · 2025-01-24T15:19:03Z

Please let me know if I missed any questions. I am going to attempt to run the code on a subset of the pdfs in the osm-pdf-uploads bucket and pipe the results to a local postgres database. I will let you know how that goes.

@quang-ng , when do you take vacation?

joshlawrimore · 2025-01-24T22:01:58Z

dsst_etl/upload_pdfs_title_is_pmid.py

+
+        This method performs the following steps:
+        1. Retrieves an iterator for paginated S3 objects.
+        2. Creates a provenance entry for the current upload process.


Suggested change

2. Creates a provenance entry for the current upload process.

2. Creates a provenance entry for the current inventory process.

quang-ng mentioned this pull request Jan 14, 2025

PDF Inventory where title is PMID #22

Open

Implement S3 inventory processing and add unit tests for OddpubWrapper

1342907

quang-ng force-pushed the irp-pdf branch from e642f28 to 3a1bef8 Compare January 22, 2025 07:50

quang-ng closed this Jan 22, 2025

quang-ng reopened this Jan 22, 2025

quang-ng force-pushed the irp-pdf branch from 3a1bef8 to 1342907 Compare January 22, 2025 09:23

quang-ng added 5 commits January 22, 2025 17:18

Remove irp_pdf.py and refactor S3 inventory processing into UploadPDF…

d823ce7

…sTitleIsPMID class. Add unit tests for OddpubWrapper to ensure PDF processing and S3 inventory functionality. Enhance error handling and logging.

Integrate Oddpub analysis with error handling and logging for improve…

529d665

…d robustness.

Add command-line interface for PDF upload script

5e185a6

Add unit tests for UploadPDFsTitleIsPMID class. Remove obsolete test …

dc97b43

…file for OddpubWrapper. Enhance error handling and logging in S3 inventory processing.

Refactor UploadPDFsTitleIsPMID class

c2e9f97

quang-ng marked this pull request as ready for review January 23, 2025 10:43

quang-ng changed the title ~~Uploading IRP PDFs~~ PDF Inventory where title is PMID Jan 23, 2025

Refactor PDF upload processing method in UploadPDFsTitleIsPMID class

14ec7ad

leej3 requested changes Jan 23, 2025

View reviewed changes

quang-ng added 9 commits January 24, 2025 13:37

Move main method to separate script and simplify PDF upload processing

3a7056a

Rename process_s3_bucket method to run and add comprehensive docstring

15b2264

Replace page-based iteration with batch processing of PDF files.

caf74bf

Removing duplicate hash checks

Added rollback mechanism for database operations to ensure data integ…

e6a8aa7

…rity during PDF uploads.

fix unit-test

8351bcf

fix unit-test

f897df5

Update pipeline name

ff5410a

OddpubWrapper to use log

ca8fbe3

Added a new method for Oddpub analysis, refined transaction rollback …

349a997

…logic, and removed redundant logging.

joshlawrimore reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Inventory where title is PMID #21

PDF Inventory where title is PMID #21

quang-ng commented Jan 8, 2025 •

edited

Loading

quang-ng commented Jan 22, 2025

leej3 left a comment

leej3 Jan 23, 2025

quang-ng Jan 24, 2025

joshlawrimore Jan 24, 2025

leej3 Jan 23, 2025

joshlawrimore Jan 24, 2025

leej3 Jan 23, 2025

joshlawrimore Jan 24, 2025

joshlawrimore Jan 24, 2025

joshlawrimore Jan 24, 2025

joshlawrimore Jan 24, 2025

joshlawrimore Jan 24, 2025

joshlawrimore commented Jan 24, 2025

joshlawrimore Jan 24, 2025

	Uploads PDFs to S3 where the title is the PMID.
	Inventory PDFs to S3 where the title is the PMID.

	comment="Upload PDFs where the title is the PMID",
	comment="Creating document inventory of PDFs where the title is the PMID",

	pipeline_name="Oddpub Analysis",
	pipeline_name="PMID Inventory and Oddpub Analysis",

	2. Creates a provenance entry for the current upload process.
	2. Creates a provenance entry for the current inventory process.

PDF Inventory where title is PMID #21

Are you sure you want to change the base?

PDF Inventory where title is PMID #21

Conversation

quang-ng commented Jan 8, 2025 • edited Loading

quang-ng commented Jan 22, 2025

leej3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshlawrimore commented Jan 24, 2025

Choose a reason for hiding this comment

quang-ng commented Jan 8, 2025 •

edited

Loading