Skip to content
This repository has been archived by the owner on Nov 18, 2020. It is now read-only.

Support for the Ingest of Large Data Files #1589

Open
mtribone opened this issue Jul 8, 2019 · 6 comments
Open

Support for the Ingest of Large Data Files #1589

mtribone opened this issue Jul 8, 2019 · 6 comments
Assignees
Labels

Comments

@mtribone
Copy link
Contributor

mtribone commented Jul 8, 2019

Write a script that will grab the files from a staging server (NFS mount). Take a package and process it in SS.

@mtribone mtribone added this to the ScholarSphere 3.9 milestone Jul 8, 2019
@DanCoughlin
Copy link
Contributor

The RIP team will have a NFS mount from their local systems to a prep space on Isilon storage. This will enable them to obtain hard drives from folks and then copy relevant data over to Isilon and begin curating them (we may be able to provide a Globus transfer of the data as well, but this will likely need a bit more work and this work is not dependent on a Globus endpoint being complete). Once RIP team has completed the curation process they can move the files that are to be published in ScholarSphere into a staging area for ingest**. DSRD team will write a script that moves files from this staging area into ScholarSphere. This script will upload into ScholarSphere bypassing the web form, which we believe will be more stable for large files. At this point we believe the threshold for this process is 10GB per file. We cannot handle anything larger at this point, and you can have a larger collection than 10GB, but no file within the collection can exceed 10GB.

@mtribone
Copy link
Contributor Author

mtribone commented Jul 8, 2019

Work will need to be created first by RePub and the folder structure on the staging server will need to be named with the same ID as the work, so that we can programmatically ingest the data into the correct work.

@awead
Copy link
Contributor

awead commented Jul 8, 2019

Image from iOS (1)

@mtribone
Copy link
Contributor Author

mtribone commented Jul 8, 2019

Start off by running the script manually instead of a cronjob and discover more about the process before automating.

@awead
Copy link
Contributor

awead commented Jul 8, 2019

Directory structure would look like:

1234xyz/
  README.md
  dataset.dat
  paper.pdf
  other.mp3

Where the work is present as https://scholarsphere.psu.edu/concern/generic_works/1234xyz

@awead awead self-assigned this Jul 8, 2019
@awead
Copy link
Contributor

awead commented Jul 10, 2019

Add https://github.com/ono/resque-cleaner for easier management of the jobs being created.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants