-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Inventory where title is PMID #21
base: main
Are you sure you want to change the base?
Conversation
the file changes look like crazy now.!!!! Create new branch for this ticket. |
…sTitleIsPMID class. Add unit tests for OddpubWrapper to ensure PDF processing and S3 inventory functionality. Enhance error handling and logging.
…file for OddpubWrapper. Enhance error handling and logging in S3 inventory processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @quang-ng. It looks like it is coming along nicely. There are a couple of clarifications on the objective I have requested from @joshlawrimore.
Overall I think it's almost there. I would prefer to see more code reuse though. PDF uploads and the associated population of tables has already done. Hopefully we can reuse that.
required=True, | ||
help="The database connection URL. This should be a valid SQLAlchemy database URL.", | ||
) | ||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3538 pdfs are in the osm-pdf-uploads bucket
The issue specifies the above input. Perhaps the code should be able to take an s3 url or a local path? @joshlawrimore any preferences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this argument, we only get pdfs files from predefine bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both would be preferable given my fondness for local user testing
document = Documents( | ||
hash_data=file_hash, | ||
s3uri=f"s3://{self.bucket_name}/{key}", | ||
provenance_id=provenance.id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshlawrimore should each of these documents have it's own provenance entry?
I think the options would be:
- it would seem that a single script execution could describe the upload/oddpub processing for all entries in the Document table and the Rtransparent/Oddpub/Processing_X table.
The schema may already imply the answer here so apologies if that is so! My understanding was that we would have a provenance table entry for every operation so:
An entry for upload, and oddpub execution for every pdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. A single provanance id corresponding to the inventory/oddpub analysis should be applies to all the document table entries. I haven't run this code locally yet, but I think that is what should happen as this is writen.
from .config import config | ||
|
||
|
||
class UploadPDFsTitleIsPMID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the choice of a class here. Much of this functionality is shared with pre-existing scripts. Ideally we would reuse that functionality, making it more generalisable where necessary to fit our purposes.
It may be the appropriate level of encapsulation though so feel free to explain the pros/cons of this approach vs another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class UploadPDFsTitleIsPMID: | |
class DocumentInventoryPMID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not really uploading anything. I used the wrong word Issue 22. The idea is to create a document inventory where the identifer the is PMID in the title.
Removing duplicate hash checks
…rity during PDF uploads.
…logic, and removed redundant logging.
|
||
class UploadPDFsTitleIsPMID: | ||
""" | ||
Uploads PDFs to S3 where the title is the PMID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploads PDFs to S3 where the title is the PMID. | |
Inventory PDFs to S3 where the title is the PMID. |
version=__version__, | ||
compute=get_compute_context_id(), | ||
personnel=config.HOSTNAME, | ||
comment="Upload PDFs where the title is the PMID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment="Upload PDFs where the title is the PMID", | |
comment="Creating document inventory of PDFs where the title is the PMID", |
def _create_provenance_entry(self): | ||
# Creates a provenance entry to track the current upload process | ||
provenance = Provenance( | ||
pipeline_name="Oddpub Analysis", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline_name="Oddpub Analysis", | |
pipeline_name="PMID Inventory and Oddpub Analysis", |
Please let me know if I missed any questions. I am going to attempt to run the code on a subset of the pdfs in the @quang-ng , when do you take vacation? |
|
||
This method performs the following steps: | ||
1. Retrieves an iterator for paginated S3 objects. | ||
2. Creates a provenance entry for the current upload process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Creates a provenance entry for the current upload process. | |
2. Creates a provenance entry for the current inventory process. |
Work for issue #22