Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple storage from OCFL Object #22

Closed
rosy1280 opened this issue Jun 4, 2018 · 10 comments
Closed

Decouple storage from OCFL Object #22

rosy1280 opened this issue Jun 4, 2018 · 10 comments
Milestone

Comments

@rosy1280
Copy link
Contributor

rosy1280 commented Jun 4, 2018

Emory: I can't necessarily keep all the files on local disk because it is too expensive. So I'll need OCFL to be able to identify what storage root the files are in.

Stanford: I'm going to move zipped (not compressed) versions to various S3 buckets and i need to track where those zipped versions went.

See also: OCFL/Use-Cases#10

@ntallman
Copy link

ntallman commented Jun 4, 2018

Penn State: Same as Emory.

@zimeon
Copy link
Contributor

zimeon commented Jun 19, 2018

I think the key question here is whether tracking the distribution and replication of OCFL objects is part of the OCFL spec or not. At Cornell we certainly have the same need to track where we keep the multiple copies we have for each object, but I see this as something outside of the object specification. I imagine separately tracking an inventory of objects.

@rosy1280 rosy1280 changed the title Decouple storage from OCFL Decouple storage from OCFL Object Jun 19, 2018
@neilsjefferies
Copy link
Member

neilsjefferies commented Jul 4, 2018

Some thoughts on OCFL and decoupled storage.

I think each OCFL structure should know about and "manage" just one instance of each unique datastream. Corollary: if there is a copy of that datastream elsewhere, it should be part of another OCFL tree - so that all parts of an OCFL object will have the same level of redundancy.

We can then define the possible relationships between two OCFL trees. One is considered the source and will be the copy which receives updates which are then propagated to...

Replicas - these are identical trees. Crudely speaking they can be synchronised using rsync-like tools, By definition, all storage is local to the filesystem on both sides. Filesystem transparent mount-points are allowed - OCFL would know nothing about them. You can have replicas of replicas.

Copies - have identical content but on either side storage may be external, and possibly asynchronous, for some data streams and OCFL is aware of this. Synchronisation therefore requires more care and must be explicitly managed. You can have copies of replicas but not vice versa.

Snapshots - contain subsets of the source tree. This can either be fewer objects, limited versions of objects, or both. Storage may be local or external. Synchronisation is generally not carried out. If it is, then it should be a pull from the Snapshot end - the source shouldn't need to know what the selection criteria are. You can have snapshots of anything, and a copy of a snapshot - but that's iy.

Each tree should know about all the copies, replicas and snapshots that are directly linked to it. But not necessarily those that are once or further removed.

@awoods
Copy link
Member

awoods commented Jul 10, 2018

@neilsjefferies : I think the definitions you are providing could be helpful, although I admit that some nuance of the distinctions between Replicas / Copies / Snapshots is probably lost on me without further clarification.

My understanding of a foundational principle of OCFL is that: if a user were to come across an OCFL structure (on a filesystem or in an object store) they would be able to make sense of that layout of files. In other words, any OCFL persistence structure must have the required elements defined by the OCFL specification:

  • OCFL Storage Root
  • OCFL Objects
  • OCFL Object Manifests
  • OCFL Object versions

@rosy1280
Copy link
Contributor Author

I'm of the mind that we should leave this issue aside entirely for version 1 of the specification. We could easily spend days working this topic out.

@rosy1280
Copy link
Contributor Author

Based on the conversation during this editorial committee meeting: https://github.com/OCFL/spec/wiki/2018.07.11-Editors-Meeting

We have decided to defer to the next version of the spec something that will allow us to decouple storage. How that will be done (within the spec or as an extension) will be determined at that time.

@julianmorley
Copy link
Contributor

Noting for the future - one solution might be to have a 'remotes' directory at the object root, that contains a hierarchy of files that describes the state of remote copies as known to the local copy. It would be an optional directory (especially if a local object isn't aware of any remote copies!). The hierarchy of the remotes dir should support a 'true up' process, where multiple 'remotes' directories from multiple copies of an object can be compared to derive a unified view of where all copies of the object reside and at what version.

@phochste
Copy link

At Ghent University in Belgium we have a similar situation. We need to store tens of terabytes in our repository but to keep it locally on disk is too expensive for us. As soon as data arrives in the archive it is shipped to an external tape drive system managed by a nation wide archiving service. If we want to add updates to an existing package in the current system we have only two options:

  • Create a new package with the new file layout (what we do now)
  • Download the previous package, add the file and upload it again

With OCFL one could imagine a version update of a package, if one could point to the previous version of the package in some way.

@zimeon zimeon added this to the 2.0 milestone Oct 2, 2019
@ptsefton
Copy link

ptsefton commented Mar 3, 2022

@marcolarosa and I are discussing a related issue. What to do if you need to split your repository across multiple file systems? We have a situation where an institution can only supply ~200TB of storage in ~60TB chunks so the repo has to be on multiple file system mount points.

One way to deal with this (similar to the remotes idea from @julianmorley above) might be to put several OCFL repositories on the smaller mounts and create a Meta Repository which is an OCFL repository containing only redirection info which looks after path resolution for a distributed repo. You could do this without touching the OCFL spec at all by having the Object roots on the meta-repo only contain "where am I" file that points to where this Object resides like BagIt's fetch mechanism - the real versioning etc would happen "over there".

If you need to balance storage directories, an out of band process could copy content from one mount to another, update the "where am I" to a new meta-version and then delete (oops, dirty word here I know) the original. Of course, you could maintain the index of where stuff is across multiple storage roots using an OCFL-aware repository API pointing at multiple roots.

Deciding where to write your next object is obviously out of scope for OCFL but we'd like to work with OCFL as seamlessly as possible.

@rosy1280
Copy link
Contributor Author

@ptsefton comment is now its own ticket OCFL/Use-Cases#43

We suspect that @phochste comment was addressed in v1. We know @rosy1280 comment was about replication of objects which we have confirmed as out of scope. We are unclear if @ntallman comment was also about replication of objects.

@phochste and @ntallman if you feel like your issue has not been addressed please open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants