-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple storage from OCFL Object #22
Comments
Penn State: Same as Emory. |
I think the key question here is whether tracking the distribution and replication of OCFL objects is part of the OCFL spec or not. At Cornell we certainly have the same need to track where we keep the multiple copies we have for each object, but I see this as something outside of the object specification. I imagine separately tracking an inventory of objects. |
Some thoughts on OCFL and decoupled storage. I think each OCFL structure should know about and "manage" just one instance of each unique datastream. Corollary: if there is a copy of that datastream elsewhere, it should be part of another OCFL tree - so that all parts of an OCFL object will have the same level of redundancy. We can then define the possible relationships between two OCFL trees. One is considered the source and will be the copy which receives updates which are then propagated to... Replicas - these are identical trees. Crudely speaking they can be synchronised using rsync-like tools, By definition, all storage is local to the filesystem on both sides. Filesystem transparent mount-points are allowed - OCFL would know nothing about them. You can have replicas of replicas. Copies - have identical content but on either side storage may be external, and possibly asynchronous, for some data streams and OCFL is aware of this. Synchronisation therefore requires more care and must be explicitly managed. You can have copies of replicas but not vice versa. Snapshots - contain subsets of the source tree. This can either be fewer objects, limited versions of objects, or both. Storage may be local or external. Synchronisation is generally not carried out. If it is, then it should be a pull from the Snapshot end - the source shouldn't need to know what the selection criteria are. You can have snapshots of anything, and a copy of a snapshot - but that's iy. Each tree should know about all the copies, replicas and snapshots that are directly linked to it. But not necessarily those that are once or further removed. |
@neilsjefferies : I think the definitions you are providing could be helpful, although I admit that some nuance of the distinctions between Replicas / Copies / Snapshots is probably lost on me without further clarification. My understanding of a foundational principle of OCFL is that: if a user were to come across an OCFL structure (on a filesystem or in an object store) they would be able to make sense of that layout of files. In other words, any OCFL persistence structure must have the required elements defined by the OCFL specification:
|
I'm of the mind that we should leave this issue aside entirely for version 1 of the specification. We could easily spend days working this topic out. |
Based on the conversation during this editorial committee meeting: https://github.com/OCFL/spec/wiki/2018.07.11-Editors-Meeting We have decided to defer to the next version of the spec something that will allow us to decouple storage. How that will be done (within the spec or as an extension) will be determined at that time. |
Noting for the future - one solution might be to have a 'remotes' directory at the object root, that contains a hierarchy of files that describes the state of remote copies as known to the local copy. It would be an optional directory (especially if a local object isn't aware of any remote copies!). The hierarchy of the remotes dir should support a 'true up' process, where multiple 'remotes' directories from multiple copies of an object can be compared to derive a unified view of where all copies of the object reside and at what version. |
At Ghent University in Belgium we have a similar situation. We need to store tens of terabytes in our repository but to keep it locally on disk is too expensive for us. As soon as data arrives in the archive it is shipped to an external tape drive system managed by a nation wide archiving service. If we want to add updates to an existing package in the current system we have only two options:
With OCFL one could imagine a version update of a package, if one could point to the previous version of the package in some way. |
@marcolarosa and I are discussing a related issue. What to do if you need to split your repository across multiple file systems? We have a situation where an institution can only supply ~200TB of storage in ~60TB chunks so the repo has to be on multiple file system mount points. One way to deal with this (similar to the remotes idea from @julianmorley above) might be to put several OCFL repositories on the smaller mounts and create a Meta Repository which is an OCFL repository containing only redirection info which looks after path resolution for a distributed repo. You could do this without touching the OCFL spec at all by having the Object roots on the meta-repo only contain "where am I" file that points to where this Object resides like BagIt's fetch mechanism - the real versioning etc would happen "over there". If you need to balance storage directories, an out of band process could copy content from one mount to another, update the "where am I" to a new meta-version and then delete (oops, dirty word here I know) the original. Of course, you could maintain the index of where stuff is across multiple storage roots using an OCFL-aware repository API pointing at multiple roots. Deciding where to write your next object is obviously out of scope for OCFL but we'd like to work with OCFL as seamlessly as possible. |
@ptsefton comment is now its own ticket OCFL/Use-Cases#43 We suspect that @phochste comment was addressed in v1. We know @rosy1280 comment was about replication of objects which we have confirmed as out of scope. We are unclear if @ntallman comment was also about replication of objects. @phochste and @ntallman if you feel like your issue has not been addressed please open a new ticket. |
Emory: I can't necessarily keep all the files on local disk because it is too expensive. So I'll need OCFL to be able to identify what storage root the files are in.
Stanford: I'm going to move zipped (not compressed) versions to various S3 buckets and i need to track where those zipped versions went.
See also: OCFL/Use-Cases#10
The text was updated successfully, but these errors were encountered: