Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data importer #21

Open
2 tasks
amdomanska opened this issue Dec 19, 2024 · 4 comments
Open
2 tasks

Implement data importer #21

amdomanska opened this issue Dec 19, 2024 · 4 comments
Assignees

Comments

@amdomanska
Copy link
Collaborator

amdomanska commented Dec 19, 2024

The importer module should be responsible for:

  • picking up the data in a zip file (S3 compatible storage)
  • managing the import
@amdomanska
Copy link
Collaborator Author

@varadekd Similar importer has been implemented for the EUI project. Please consult @Steven-Eardley and @J4bbi

@varadekd
Copy link
Collaborator

@amdomanska can you add a little more context to this, this will help me in understand the flow where this importer will be used and also IMO it will be good idea to have that information on ticket so in future we exactly know what we did and why we did.

@varadekd
Copy link
Collaborator

@Steven-Eardley I would like your help on this matter. In terms of s3 and what we implemented in EUI project.

@amdomanska
Copy link
Collaborator Author

@varadekd

Here’s a breakdown of what needs to be done:

S3 Client Development:

  • You’ll need to create a client to interact with the S3-compatible storage where the data will be hosted.
  • Hrafn and Steve have experience with S3 storage integration, so feel free to reach out to them for guidance if needed.
  • We don’t have specifics on authentication yet, so please use the simplest method available (likely access keys) while ensuring it is modular and easy to replace if we need to adapt later.

Data Import:

  • Since you worked on generating the test data, you already have the knowledge about its structure. No additional validation steps are required during import, as we assume the data is valid.

Crosswalks:

  • The crosswalks developed by Hrafn are currently on his branch, waiting for the merge, here is the PR
  • The code Hrafn demoed includes all transformations, using Invenio custom fields with minimal reliance on Dacite.

Automation and Scheduling:

  • The system must periodically (every 24 hours) check for new data in the S3 storage and, if found, import and process it.
  • We’d like to explore how this can be managed using Invenio Tasks. Hrafn and Steve may have insights on its implementation.

Performance Optimization:

  • Importing and processing one file is estimated to take around three hours. To improve efficiency, implement threading to support parallel processing within a single file.

Feel free to coordinate with Hrafn for Invenio-specific questions or crosswalk details and with Steve regarding S3-related tasks. Let me know if you need anything else or have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants