Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Inventory where title is PMID #21

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

PDF Inventory where title is PMID #21

wants to merge 16 commits into from

Conversation

quang-ng
Copy link
Collaborator

@quang-ng quang-ng commented Jan 8, 2025

Work for issue #22

@quang-ng
Copy link
Collaborator Author

the file changes look like crazy now.!!!! Create new branch for this ticket.

…sTitleIsPMID class. Add unit tests for OddpubWrapper to ensure PDF processing and S3 inventory functionality. Enhance error handling and logging.
…file for OddpubWrapper. Enhance error handling and logging in S3 inventory processing.
@quang-ng quang-ng marked this pull request as ready for review January 23, 2025 10:43
@quang-ng quang-ng changed the title Uploading IRP PDFs PDF Inventory where title is PMID Jan 23, 2025
Copy link
Contributor

@leej3 leej3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @quang-ng. It looks like it is coming along nicely. There are a couple of clarifications on the objective I have requested from @joshlawrimore.

Overall I think it's almost there. I would prefer to see more code reuse though. PDF uploads and the associated population of tables has already done. Hopefully we can reuse that.

dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
required=True,
help="The database connection URL. This should be a valid SQLAlchemy database URL.",
)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3538 pdfs are in the osm-pdf-uploads bucket

The issue specifies the above input. Perhaps the code should be able to take an s3 url or a local path? @joshlawrimore any preferences?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this argument, we only get pdfs files from predefine bucket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both would be preferable given my fondness for local user testing

dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
document = Documents(
hash_data=file_hash,
s3uri=f"s3://{self.bucket_name}/{key}",
provenance_id=provenance.id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshlawrimore should each of these documents have it's own provenance entry?
I think the options would be:

  • it would seem that a single script execution could describe the upload/oddpub processing for all entries in the Document table and the Rtransparent/Oddpub/Processing_X table.
    The schema may already imply the answer here so apologies if that is so! My understanding was that we would have a provenance table entry for every operation so:
    An entry for upload, and oddpub execution for every pdf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. A single provanance id corresponding to the inventory/oddpub analysis should be applies to all the document table entries. I haven't run this code locally yet, but I think that is what should happen as this is writen.

dsst_etl/upload_pdfs_title_is_pmid.py Outdated Show resolved Hide resolved
from .config import config


class UploadPDFsTitleIsPMID:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the choice of a class here. Much of this functionality is shared with pre-existing scripts. Ideally we would reuse that functionality, making it more generalisable where necessary to fit our purposes.
It may be the appropriate level of encapsulation though so feel free to explain the pros/cons of this approach vs another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class UploadPDFsTitleIsPMID:
class DocumentInventoryPMID:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not really uploading anything. I used the wrong word Issue 22. The idea is to create a document inventory where the identifer the is PMID in the title.


class UploadPDFsTitleIsPMID:
"""
Uploads PDFs to S3 where the title is the PMID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Uploads PDFs to S3 where the title is the PMID.
Inventory PDFs to S3 where the title is the PMID.

version=__version__,
compute=get_compute_context_id(),
personnel=config.HOSTNAME,
comment="Upload PDFs where the title is the PMID",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
comment="Upload PDFs where the title is the PMID",
comment="Creating document inventory of PDFs where the title is the PMID",

def _create_provenance_entry(self):
# Creates a provenance entry to track the current upload process
provenance = Provenance(
pipeline_name="Oddpub Analysis",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pipeline_name="Oddpub Analysis",
pipeline_name="PMID Inventory and Oddpub Analysis",

@joshlawrimore
Copy link
Contributor

Please let me know if I missed any questions. I am going to attempt to run the code on a subset of the pdfs in the osm-pdf-uploads bucket and pipe the results to a local postgres database. I will let you know how that goes.

@quang-ng , when do you take vacation?


This method performs the following steps:
1. Retrieves an iterator for paginated S3 objects.
2. Creates a provenance entry for the current upload process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Creates a provenance entry for the current upload process.
2. Creates a provenance entry for the current inventory process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants