Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-file OCFL object storage (e.g., Tar, Zip) #10

Closed
ahankinson opened this issue Feb 27, 2018 · 7 comments
Closed

Single-file OCFL object storage (e.g., Tar, Zip) #10

ahankinson opened this issue Feb 27, 2018 · 7 comments
Labels
Confirmed: Out-of-scope Use case will not be included in the upcoming version of the spec or implementation notes.

Comments

@ahankinson
Copy link
Contributor

ahankinson commented Feb 27, 2018

A multinational astronomical research initiative has several terabyte-sized datasets that it wishes to make available to researchers around the world. These datasets are published in 1 TB-sized files, and so their server filesystem is optimized for very large-sized file storage. Their OCFL Objects are stored as ZIP files to help reduce the number of small files on their storage system. They implement an OCFL server that is able to use the ZIP file header to seek within a file and extract a particular file with low overhead, effectively providing 'directory-like' lookups.

@awoods awoods added the Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. label Mar 28, 2018
@ahankinson ahankinson changed the title Large-dataset storage Single-file OCFL object storage (e.g., Tar, Zip) Apr 3, 2018
@ahankinson
Copy link
Contributor Author

Notes from LDCX: Uncompressed ZIP is preferred over TAR due to a more deterministic approach to header reading and better support for path names.

Uncompressed ZIP is ISO/IEC 21320-1:2015

@zimeon
Copy link
Contributor

zimeon commented Apr 4, 2018

I think this should be in-scope because the idea of a self-contained object as one resource is potentially useful for storage (and will help us think about transfer). To me this speaks to the split between the object-location part of the spec and the object-structure part of the spec. I can image object-location having {root}/{id-based-pairtree} under which we have either a folder called {id} or a file {id}.zip.

Are there utilities that will use byte-range requests to effectively access ZIPs in an HTTP object store?

@julianmorley
Copy link
Contributor

https://gist.github.com/julianmorley/fbcff1f33a1113fb2ec6ea51fc06e46c
I've sketched out a definition for an inventory-archive.json that could track large, archive-file objects. Combined with the regular inventory.json there should be enough info to be able to locate a desired file within the archives.

Practically, we should plan on any one version of an OCFL object being stored in one or more archive files. For example, we plan to segment any one version of our large objects into 10GB zip segments.

@ahankinson ahankinson added Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. and removed Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. labels Jun 5, 2018
@ahankinson
Copy link
Contributor Author

The original use case is slightly different from the one assumed in your solution, @julianmorley. It was that an entire OCFL Object can be stored as an uncompressed ZIP file, which could then be treated as a writeable object. (I've edited the text above and clarified this a bit)

I believe you are assuming individual zipped-up version directories. I think this would be a separate valid use case, so I will file one and reference this one.

@neilsjefferies
Copy link
Member

Treating a zip as a writable object is not smart - updates will result in in situ temp file writing of equivalent size to the zip which breaks many OCFL assumptions. A zip can, however be mounted as a file system for reading. FWIW Sun Honeycomb object stores had the code to do that but it was never in a release version.

@rosy1280
Copy link
Contributor

potentially a sub-use case of #39

@rosy1280 rosy1280 added Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. Confirmed: Out-of-scope Use case will not be included in the upcoming version of the spec or implementation notes. and removed Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. labels Sep 22, 2023
@zimeon
Copy link
Contributor

zimeon commented Sep 22, 2023

Editors' discussion 2023-09-22: We have not heard of an implementation where zip-per-object desired. The treatment of ZIPs as writeable objects is not a good idea because the implementation will need a temp file the size of the uncompressed ZIP. See instead the zip-per-version use case, see #33.

@zimeon zimeon closed this as completed Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Confirmed: Out-of-scope Use case will not be included in the upcoming version of the spec or implementation notes.
Projects
None yet
Development

No branches or pull requests

6 participants