Implement data importer #21

amdomanska · 2024-12-19T13:14:25Z

The importer module should be responsible for:

picking up the data in a zip file (S3 compatible storage)
managing the import

amdomanska · 2024-12-19T13:15:45Z

@varadekd Similar importer has been implemented for the EUI project. Please consult @Steven-Eardley and @J4bbi

varadekd · 2025-01-15T06:34:09Z

@amdomanska can you add a little more context to this, this will help me in understand the flow where this importer will be used and also IMO it will be good idea to have that information on ticket so in future we exactly know what we did and why we did.

varadekd · 2025-01-15T06:34:38Z

@Steven-Eardley I would like your help on this matter. In terms of s3 and what we implemented in EUI project.

amdomanska · 2025-01-15T10:11:23Z

@varadekd

Here’s a breakdown of what needs to be done:

S3 Client Development:

You’ll need to create a client to interact with the S3-compatible storage where the data will be hosted.
Hrafn and Steve have experience with S3 storage integration, so feel free to reach out to them for guidance if needed.
We don’t have specifics on authentication yet, so please use the simplest method available (likely access keys) while ensuring it is modular and easy to replace if we need to adapt later.

Data Import:

Since you worked on generating the test data, you already have the knowledge about its structure. No additional validation steps are required during import, as we assume the data is valid.

Crosswalks:

The crosswalks developed by Hrafn are currently on his branch, waiting for the merge, here is the PR
The code Hrafn demoed includes all transformations, using Invenio custom fields with minimal reliance on Dacite.

Automation and Scheduling:

The system must periodically (every 24 hours) check for new data in the S3 storage and, if found, import and process it.
We’d like to explore how this can be managed using Invenio Tasks. Hrafn and Steve may have insights on its implementation.

Performance Optimization:

Importing and processing one file is estimated to take around three hours. To improve efficiency, implement threading to support parallel processing within a single file.

Feel free to coordinate with Hrafn for Invenio-specific questions or crosswalk details and with Steve regarding S3-related tasks. Let me know if you need anything else or have further questions!

amdomanska assigned varadekd Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement data importer #21

Implement data importer #21

amdomanska commented Dec 19, 2024 •

edited

Loading

amdomanska commented Dec 19, 2024

varadekd commented Jan 15, 2025

varadekd commented Jan 15, 2025

amdomanska commented Jan 15, 2025

Implement data importer #21

Implement data importer #21

Comments

amdomanska commented Dec 19, 2024 • edited Loading

amdomanska commented Dec 19, 2024

varadekd commented Jan 15, 2025

varadekd commented Jan 15, 2025

amdomanska commented Jan 15, 2025

amdomanska commented Dec 19, 2024 •

edited

Loading