Decouple storage from OCFL Object #22

rosy1280 · 2018-06-04T22:33:14Z

Emory: I can't necessarily keep all the files on local disk because it is too expensive. So I'll need OCFL to be able to identify what storage root the files are in.

Stanford: I'm going to move zipped (not compressed) versions to various S3 buckets and i need to track where those zipped versions went.

See also: OCFL/Use-Cases#10

ntallman · 2018-06-04T23:42:39Z

Penn State: Same as Emory.

zimeon · 2018-06-19T12:31:46Z

I think the key question here is whether tracking the distribution and replication of OCFL objects is part of the OCFL spec or not. At Cornell we certainly have the same need to track where we keep the multiple copies we have for each object, but I see this as something outside of the object specification. I imagine separately tracking an inventory of objects.

neilsjefferies · 2018-07-04T15:18:15Z

Some thoughts on OCFL and decoupled storage.

I think each OCFL structure should know about and "manage" just one instance of each unique datastream. Corollary: if there is a copy of that datastream elsewhere, it should be part of another OCFL tree - so that all parts of an OCFL object will have the same level of redundancy.

We can then define the possible relationships between two OCFL trees. One is considered the source and will be the copy which receives updates which are then propagated to...

Replicas - these are identical trees. Crudely speaking they can be synchronised using rsync-like tools, By definition, all storage is local to the filesystem on both sides. Filesystem transparent mount-points are allowed - OCFL would know nothing about them. You can have replicas of replicas.

Copies - have identical content but on either side storage may be external, and possibly asynchronous, for some data streams and OCFL is aware of this. Synchronisation therefore requires more care and must be explicitly managed. You can have copies of replicas but not vice versa.

Snapshots - contain subsets of the source tree. This can either be fewer objects, limited versions of objects, or both. Storage may be local or external. Synchronisation is generally not carried out. If it is, then it should be a pull from the Snapshot end - the source shouldn't need to know what the selection criteria are. You can have snapshots of anything, and a copy of a snapshot - but that's iy.

Each tree should know about all the copies, replicas and snapshots that are directly linked to it. But not necessarily those that are once or further removed.

awoods · 2018-07-10T20:31:51Z

@neilsjefferies : I think the definitions you are providing could be helpful, although I admit that some nuance of the distinctions between Replicas / Copies / Snapshots is probably lost on me without further clarification.

My understanding of a foundational principle of OCFL is that: if a user were to come across an OCFL structure (on a filesystem or in an object store) they would be able to make sense of that layout of files. In other words, any OCFL persistence structure must have the required elements defined by the OCFL specification:

OCFL Storage Root
OCFL Objects
OCFL Object Manifests
OCFL Object versions

rosy1280 · 2018-07-11T12:11:14Z

I'm of the mind that we should leave this issue aside entirely for version 1 of the specification. We could easily spend days working this topic out.

rosy1280 · 2018-07-11T15:28:06Z

Based on the conversation during this editorial committee meeting: https://github.com/OCFL/spec/wiki/2018.07.11-Editors-Meeting

We have decided to defer to the next version of the spec something that will allow us to decouple storage. How that will be done (within the spec or as an extension) will be determined at that time.

julianmorley · 2018-07-20T15:35:38Z

Noting for the future - one solution might be to have a 'remotes' directory at the object root, that contains a hierarchy of files that describes the state of remote copies as known to the local copy. It would be an optional directory (especially if a local object isn't aware of any remote copies!). The hierarchy of the remotes dir should support a 'true up' process, where multiple 'remotes' directories from multiple copies of an object can be compared to derive a unified view of where all copies of the object reside and at what version.

phochste · 2019-06-13T04:27:38Z

At Ghent University in Belgium we have a similar situation. We need to store tens of terabytes in our repository but to keep it locally on disk is too expensive for us. As soon as data arrives in the archive it is shipped to an external tape drive system managed by a nation wide archiving service. If we want to add updates to an existing package in the current system we have only two options:

Create a new package with the new file layout (what we do now)
Download the previous package, add the file and upload it again

With OCFL one could imagine a version update of a package, if one could point to the previous version of the package in some way.

ptsefton · 2022-03-03T02:23:20Z

@marcolarosa and I are discussing a related issue. What to do if you need to split your repository across multiple file systems? We have a situation where an institution can only supply ~200TB of storage in ~60TB chunks so the repo has to be on multiple file system mount points.

One way to deal with this (similar to the remotes idea from @julianmorley above) might be to put several OCFL repositories on the smaller mounts and create a Meta Repository which is an OCFL repository containing only redirection info which looks after path resolution for a distributed repo. You could do this without touching the OCFL spec at all by having the Object roots on the meta-repo only contain "where am I" file that points to where this Object resides like BagIt's fetch mechanism - the real versioning etc would happen "over there".

If you need to balance storage directories, an out of band process could copy content from one mount to another, update the "where am I" to a new meta-version and then delete (oops, dirty word here I know) the original. Of course, you could maintain the index of where stuff is across multiple storage roots using an OCFL-aware repository API pointing at multiple roots.

Deciding where to write your next object is obviously out of scope for OCFL but we'd like to work with OCFL as seamlessly as possible.

rosy1280 · 2023-09-22T19:33:10Z

@ptsefton comment is now its own ticket OCFL/Use-Cases#43

We suspect that @phochste comment was addressed in v1. We know @rosy1280 comment was about replication of objects which we have confirmed as out of scope. We are unclear if @ntallman comment was also about replication of objects.

@phochste and @ntallman if you feel like your issue has not been addressed please open a new ticket.

ahankinson mentioned this issue Jun 5, 2018

Individual version directories can be stored external to the OCFL object OCFL/Use-Cases#27

Closed

rosy1280 changed the title ~~Decouple storage from OCFL~~ Decouple storage from OCFL Object Jun 19, 2018

rosy1280 added the deferred label Jul 11, 2018

zimeon mentioned this issue Sep 26, 2018

Requirements for inventory.jsonld #26

Closed

zimeon mentioned this issue Mar 6, 2019

Be explicit about version sequence requirements for OCFL Object validity #306

Closed

zimeon mentioned this issue Jun 12, 2019

Ability to reference a file/datastream outside of an OCFL Object OCFL/Use-Cases#35

Closed

zimeon added this to the 2.0 milestone Oct 2, 2019

marcolarosa mentioned this issue Sep 22, 2023

Defining a repository from peer storage roots OCFL/Use-Cases#43

Closed

rosy1280 closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple storage from OCFL Object #22

Decouple storage from OCFL Object #22

rosy1280 commented Jun 4, 2018

ntallman commented Jun 4, 2018

zimeon commented Jun 19, 2018 •

edited

Loading

neilsjefferies commented Jul 4, 2018 •

edited

Loading

awoods commented Jul 10, 2018

rosy1280 commented Jul 11, 2018

rosy1280 commented Jul 11, 2018

julianmorley commented Jul 20, 2018

phochste commented Jun 13, 2019

ptsefton commented Mar 3, 2022 •

edited

Loading

rosy1280 commented Sep 22, 2023

Decouple storage from OCFL Object #22

Decouple storage from OCFL Object #22

Comments

rosy1280 commented Jun 4, 2018

ntallman commented Jun 4, 2018

zimeon commented Jun 19, 2018 • edited Loading

neilsjefferies commented Jul 4, 2018 • edited Loading

awoods commented Jul 10, 2018

rosy1280 commented Jul 11, 2018

rosy1280 commented Jul 11, 2018

julianmorley commented Jul 20, 2018

phochste commented Jun 13, 2019

ptsefton commented Mar 3, 2022 • edited Loading

rosy1280 commented Sep 22, 2023

zimeon commented Jun 19, 2018 •

edited

Loading

neilsjefferies commented Jul 4, 2018 •

edited

Loading

ptsefton commented Mar 3, 2022 •

edited

Loading