Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created scripts/hhs_doi_cli.py and updated dependencies in pyproject.… #11

Merged
merged 6 commits into from
Dec 2, 2024

Conversation

joshlawrimore
Copy link
Contributor

@joshlawrimore joshlawrimore commented Nov 27, 2024

…toml.

@agt24 modified the filter_cli.py CLI to include DOI extraction with pdf2doi. I further modified it and created this PR. @leej3 , If it doesn't break anything, please approve the review. I'm happy to take any suggestions.

Copy link
Collaborator

@quang-ng quang-ng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we relocate the processing code to the dsst-etl folder and create a unit test for it? Here, we only call the function.

scripts/hhs_doi_cli.py Outdated Show resolved Hide resolved
output_exists: bool = output_csv.exists()

if output_exists and new_run is False:
print("Removing previously processed PDFs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we using logger instead of print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Set up a package-level logger at dsst_etl.logger

pdfs = list(pdfs)

dicts: list[dict] = []
with open(output_csv, "a") as f_out:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too complex, could we break it into smaller functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I broke the main function into 3 and then put all functionout outside of argparse in dsst_etl.hhs_doi.py

identifier = doi_info.get("identifier")
identifier_type = doi_info.get("identifier_type")
extraction_method = doi_info.get("method")
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a logger here so we can identify which exception occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@leej3 leej3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run the code but it looks reasonable to merge. I'm assuming this functionality will be reviewed/used in upcoming development.

I agree with @quang-ngs points:

  • Using different logging levels will make things easier to debug in the long run.
  • Files in the script directory should have little but a CLI... other functionality should be in the package and ideally it should be tested.
  • The function is too large which will harm maintainability. The deep nesting can be avoided by turning the core functionality into its own function.

Copy link
Collaborator

@quang-ng quang-ng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!!! 🚀 🚀 🚀 🚀 🚀 🚀 🚀

@joshlawrimore joshlawrimore merged commit d372ddb into main Dec 2, 2024
4 checks passed
@joshlawrimore joshlawrimore deleted the filter branch December 2, 2024 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants