-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created scripts/hhs_doi_cli.py and updated dependencies in pyproject.… #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we relocate the processing code to the dsst-etl folder and create a unit test for it? Here, we only call the function.
scripts/hhs_doi_cli.py
Outdated
output_exists: bool = output_csv.exists() | ||
|
||
if output_exists and new_run is False: | ||
print("Removing previously processed PDFs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we using logger instead of print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Set up a package-level logger at dsst_etl.logger
scripts/hhs_doi_cli.py
Outdated
pdfs = list(pdfs) | ||
|
||
dicts: list[dict] = [] | ||
with open(output_csv, "a") as f_out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too complex, could we break it into smaller functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I broke the main function into 3 and then put all functionout outside of argparse in dsst_etl.hhs_doi.py
scripts/hhs_doi_cli.py
Outdated
identifier = doi_info.get("identifier") | ||
identifier_type = doi_info.get("identifier_type") | ||
extraction_method = doi_info.get("method") | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert a logger here so we can identify which exception occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Co-authored-by: Quang Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't run the code but it looks reasonable to merge. I'm assuming this functionality will be reviewed/used in upcoming development.
I agree with @quang-ngs points:
- Using different logging levels will make things easier to debug in the long run.
- Files in the script directory should have little but a CLI... other functionality should be in the package and ideally it should be tested.
- The function is too large which will harm maintainability. The deep nesting can be avoided by turning the core functionality into its own function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!!! 🚀 🚀 🚀 🚀 🚀 🚀 🚀
…he current working directory.
…toml.
@agt24 modified the filter_cli.py CLI to include DOI extraction with pdf2doi. I further modified it and created this PR. @leej3 , If it doesn't break anything, please approve the review. I'm happy to take any suggestions.