Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCFL Object Forking #44

Open
lnielsen opened this issue Jun 12, 2019 · 24 comments
Open

OCFL Object Forking #44

lnielsen opened this issue Jun 12, 2019 · 24 comments
Labels
Component: Specification Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes.

Comments

@lnielsen
Copy link

lnielsen commented Jun 12, 2019

In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call

  • Versioning (each version get's a new DOI - at the repository level each version is a separate record)
  • Revisions (edits to a single version - at the repository level this a single record).

In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.

They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.

Example

Imagine these actions:

  1. Publish first version 10.5281/zenodo.1234 with two very large (let's just say 100TB to exaggerate) files: data-01.zip and mishap.zip
  2. Publish new version 10.5281/zenodo.4321 with one new file: data-02.zip (files is thus: data-01.zip and data-02.zip).
  3. Remove mishap.zip from 10.5281/zenodo.1234

The OCFL objects would be:

[10.5821/zenodo.1234]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    └── v2
        ├── inventory.json
        ├── inventory.json.sha512
        └── content


[10.5821/zenodo.4321]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    └── v1
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            ├── data-01.zip (duplicatied 100TB of data!!!)
            └── data-02.zip

What I would like is not having to duplicate data-01.zip in 10.5821/zenodo.4321 OCFL object?

Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?

@ahankinson
Copy link
Contributor

Depending on the underlying storage system you have in place your disks may be doing this deduplication transparently.

How do you handle this case at the moment?

@lnielsen
Copy link
Author

lnielsen commented Jun 12, 2019

Our storage system doesn't handle it (it's http://eos.web.cern.ch with some 400PB of disk space). Essentially if e.g. hard symlinks where allowed, some system operating on the OCFL objects probably even wouldn't know that it's deduplicated.

The problem is with either the requirement on not using hard links:

Hard and soft (symbolic) links are not portable and must not be used within 
OCFL Storage hierarchies. A common use case for links is storage deduplication. 
OCFL inventories provide a portable method of achieving the same effect by using 
digests to address content.

or with the assumed linear versioning.

Note, I've been discussing this with @neilsjefferies IRL as well.

@zimeon
Copy link
Contributor

zimeon commented Jun 12, 2019

I think this is a special case of reference to a file/datastream external to an OCFL object. This is partly discussed in #27 but I created #35 to separate out the idea of an external file. IMO this is out-of-scope for v1 but we should revisit when considering scope of v2.

@lnielsen
Copy link
Author

A reference to file/datastream in another OCFL object could solve the issue. My general thinking here is that a reference to file/datastream anywhere is not a good idea, but that instead it should be constrained to the OCFL storage root.

I fully understand that you want to get v1 out the door. Just know that this is kind of a show stopper for using OCFL for us, so a quick v2 release afterwards would be much appreciated. We have 1.4 million OCFL objects and 300TB of data to write, so I'd prefer not having to rewrite them :-) Obviously, I'm happy to help out, in case there's anything I can do to accelerate it.

@awoods
Copy link
Member

awoods commented Jul 5, 2019

Thanks, @lnielsen.
Taking a step back, for clarification, what is the rationale for your decision of:

They [sic] way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object.

It is conceivable that separate versions of a single OCFL Object could have their own DOIs.

@lnielsen
Copy link
Author

lnielsen commented Jul 9, 2019

@awoods It's related to the two levels of versioning that I call versioning and revisions and that they can happen in different sequences (e.g. v1.0, v2.0, v1.1 or v1.0, v1.1, v2.0).

I'll try to see if I can give a clear example 😄 and of course don't hesitate to let me know if there's something obvious that I just haven't seen.

If I change my initial example to use a single OCFL object it would look like this (after the three actions):

[multi-doi-object]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1               # 10.5821/zenodo.1234
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    ├── v2               # 10.5821/zenodo.4321
    |   ├── inventory.json
    |   ├── inventory.json.sha512
    |   └── content
    |       └── data-02.zip
    └── v3               # 10.5821/zenodo.1234
        ├── inventory.json
        ├── inventory.json.sha512
        └── content

So far so good. I've managed to represent the changes in an OCFL object.

Now let's switch the order of actions from 1, 2, 3 to 1, 3, 2. My OCFL object would instead look like this:

[multi-doi-object]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1               # 10.5821/zenodo.1234
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    ├── v2               # 10.5821/zenodo.1234
    |   ├── inventory.json
    |   ├── inventory.json.sha512
    |   └── content
    └── v3               # 10.5821/zenodo.4321
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            └── data-02.zip

So far so good as well. I've achieved deduplication of the big file.

The problem I see with this structure is that it's non-trivial/non-intuitve to find the latest state of a specific DOI, and thus requires interpretation on top of OCFL in order to be understandable. The reason for using OCFL in the first place, is to have an self-evident structure that requires no other knowledge than OCFL.

Similarly, I could also imagine hacks to make things work like writing a completely new OCFL object and deleting the old one. But then performance would be an issue.

@julianmorley
Copy link
Contributor

julianmorley commented Aug 6, 2019

Hi @lnielsen! We have this issue at Stanford ("In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.") and don't have a perfect solution, but have approached it in two ways:

  1. If a user inadvertently accessions personal info in an object, we have to purge the entire object from SDR and re-accession it with the same identifier and cleaned content. It's a pain to do (deletes are hard by design!) but it's the only way to truly purge sensitive data from an existing object.

  2. For incremental additions to large datasets, we try to break the dataset into smaller logical pieces (still in zip files, but not one big zip for the entire dataset). This also requires some curatorial intervention but we've found that it provides a slightly better user experience, especially for downloading that dataset. It also gives us a chance that future dataset changes impact only a handful of prior zips (or maybe even non at all!), allowing us to leverage the incremental diff feature of Moab (which OCFL also implements).

@neilsjefferies
Copy link
Member

Copies from use-cases...general musing...so not completely thought out,

I can imagine a minor modification to the inventory that adds "inherits from ObjectID" type sections to the manifest. The digests that follow identify paths in other OCFL object(s). Other than that nothing else needs to change. When copying an object, parsing the manifest tells you which additional objects it has dependencies on. It would permit version forking and inter-object deduplication. this does mean that if object versions are not stored as a single units then each version has a new ID - this is not necessarily a bad thing.

...this might also be adapted to include "Inherits from external_storage_path" in some form.

@awoods awoods transferred this issue from OCFL/spec Sep 22, 2023
@neilsjefferies neilsjefferies added Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. Component: Specification labels Sep 22, 2023
@rosy1280 rosy1280 added Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. and removed Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. labels Sep 22, 2023
@rosy1280 rosy1280 changed the title Deduplication between OCFL objects OCFL Object Forking Sep 22, 2023
@neilsjefferies
Copy link
Member

  • In spec, must be explicit that a newly created object is “inheriting” from an existing object within the same storage root (thus while we violate completeness for an object, we will still have completeness for a storage root)
    -- However, if in v2, we come up with a mechanism for multiple storage roots making one repo, we should support inheritance from any of the roots making up the repo (which might mean one has a general root reference mechanism.... But with a dire warning to be prudent)
  • We do not want to support an OCFL object referencing files from across multiple other objects. This is to prevent validation loops. Thus:
    -- Inheritance can only happen when a new object is instantiated, an entire manifest block of the source object is inherited
    -- We will only support inheritance from a single object
    -- The new object must clearly reference the version of the original object from which it is inheriting
  • Implementation notes:
    -- Validation failure if original object subsequently is purged
    -- Validation of new object must also validate the original (parent) object. This is a reason why we have the same storage root requirement.
    -- validator must have guard rails to catch loops.

@rosy1280
Copy link
Contributor

Feedback on Use Cases

In advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments.

Polling on Use Cases

In addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as Proposed: In Scope for version 2. You can contribute to the poll for this use case by reacting to this comment. The following reactions are supported:

In favor of the use case Against the use case Neutral on the use case
👍🏼 👎🏼 👀

The poll will remain open through the end of February 2024.

@srerickson
Copy link

srerickson commented Oct 31, 2023

This could be quite complicated if #42 (file deletion) also makes it into v2. Implementations would need to handle (or prevent) deletion of inherited files in the parent. Bi-directional references (both child-to-parent and parent-to-child) would make it easier to understand down stream consequences of a file deletion.

@je4
Copy link

je4 commented Nov 4, 2023

Just adding a new key "inherits", containing a list of object ids including version, to the basic inventory structure should not be problematic and won't interfere with any other features. on the same level, there could be a "deprecates" key too.

@rosy1280
Copy link
Contributor

At the time of this comment the vote tallied to +3. Confirming this as in scope for version 2 -- of course how to do that is still a question.

@rosy1280 rosy1280 added Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes. and removed Proposed: In-Scope Use case is up for discussion and may change the spec, implementation notes, or become an extension. labels Feb 29, 2024
@zimeon zimeon added this to the Supported in v2.0 milestone Feb 29, 2024
@rosy1280
Copy link
Contributor

Object Forking Notes (File Inheritance)

These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.

  • We support this by inserting one or more pointers to one or more files in one or more parent objects. This is placed in the manifest block of the inheriting child object.
  • The version state block lists the logical path as normal, allowing users to change the file name when inherited from a parent.
  • A child object can inherit arbitrary files from multiple parent objects; it's not limited to the set of files of a single version from a single parent object. This is an implementation detail.
  • However, an implementer may choose to limit this feature to all files in a specific version of a single parent object, if desired. This is also an implementation detail.
  • A child object can only inherit files from parent objects in the same storage root. OCFL has no mechanism for referencing files outside of the current storage root.
  • Inherited files cannot be included in the child's fixity block, and the verifier must lookup the parent object.
  • The child object must use the same "digestAlgorithm" as all parent objects.
  • File inheritance MUST NOT inherit a file from a grandparent. i.e., the act of creating a file link involves verifying with the parent object that the file exists in that object, and is not itself a pointer to another object's file.
    • There is no benefit to inheriting a file from a grandparent, it only creates complexity and the specification aims for simplicity.
    • To prevent recursion loops, validators must only check to one level of recursion when validating any object.

When a parent object is deleted:

  • In a storage root that supports file inheritance a flag MUST be placed in the ocfl_layout.json file.
  • If you delete an object, you MUST check whether another object inherits files from that object. Implementation notes will address how to do this.
  • We will create an extension as part of version 2 allowing you to document the child objects that depend on files in the parent object.
    verification of child objects will fail with a descriptive error (parent object no longer exists)

When a referenced file is deleted in a parent object:

  • Tombstoning will be propagated via the verification process of the child object (i.e., the file has been deleted in parent object).
  • A soft delete or rename in the parent object does not impact the child object in any way, as the original bitstream remains on disk in the parent's content directory and referenced in the parent's inventory.
  • A child object is invalid if the current state block of a child object references a deleted file in a parent object.

Question:

  • Should the tombstones get placed in the inventory.json of the child object?
  • Or does the implementation notes address the use of tombstoning in a parent as it may make a child object invalid?

When a file is corrupted in a parent object:

  • verification process should flag it the same as in parent object (i.e. file is corrupted in parent object)

A full inventory.json example of file inheritance

{
  "digestAlgorithm": "sha512",
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "manifest": {
    "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
    "7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
    "df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
    "ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
  },
  "type": "https://ocfl.io/1.1/spec/#inventory",
  "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
      "state": {
        "7dcc35...c31": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff",
      "state": {
        "4d27c8...b53": [ "foo/bar.xml" ],
        "df83e1...a3e": [ "bigdata.dat" ],
        "ffccf6...62e": [ "image.tiff" ]
      },
      "user": {
        "address": "mailto:[email protected]",
        "name": "Cecilia"
      }
    }
  }
}

@je4
Copy link

je4 commented Sep 20, 2024

Would it be a solution to change the manifest definition from

The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest

to

The value for each key MUST be an array containing the content paths of files in the OCFL Object that have content with the given digest or an URI which refers to exactly one object.

This would mean, that there's just an URI check (a colon in the string) needed to figure out, whether the file is inside the OCFL object or remote.
Having a union as value is quite hard to implement.

@srerickson
Copy link

srerickson commented Sep 20, 2024

I agree with @je4: my preference would be to avoid designs where a schema value can have more than one possible type (i.e., string or json object). Besides the suggestion from @je4, above, another approach would be to define manifest values as objects like:

{
  "4d27c8...b53": { "paths": ["v2/content/foo/bar.xml"] },
  "7dcc35...c31": { "id": "ark:/67890/fgh123" },
  "df83e1...a3e": { "id": "ark:/67890/fgh123" },
  "ffccf6...62e": { "id": "ark:/67890/fgh123" }
}

@srerickson
Copy link

Yet another approach:

...
"manifest": {
  "4d27c8...b53": ["v2/content/foo/bar.xml"],
},
"refs": {
  "7dcc35...c31": "ark:/67890/fgh123",
  "df83e1...a3e": "ark:/67890/fgh123",
  "ffccf6...62e": "ark:/67890/fgh123"
}
...

The idea here is to add a new key in the inventory (e.g., refs) for references to other objects. Digests in the version state must be included in either the manifest or the refs block.

@srerickson
Copy link

After some more thought, I think I prefer an approach where the structure and semantics of the manifest block doesn't change in ocfl v2. Instead, references to external content can be tracked in a separate block (e.g., refs), as illustrated in my previous comment.

The current spec uses the "digest to array of paths" mapping in three places (manifest, version state, and fixity), and I think it's a design strength that the same pattern is repeated in multiple places. Changing the manifest block would be a step in the wrong direction, I feel.

@awoods
Copy link
Member

awoods commented Dec 13, 2024

Thanks, @srerickson . I appreciate that perspective.

@neilsjefferies
Copy link
Member

The current spec uses the "digest to array of paths" mapping in three places (manifest, version state, and fixity), and I think it's a design strength that the same pattern is repeated in multiple places. Changing the manifest block would be a step in the wrong direction, I feel.

I like this - also the presence of a refs block clearly signposts that this feature is in use which is a similar pattern to the presence of a tombstones block proposed for #42 and #14. Both indicate that the OCFL object is "not-complete" in some sense.

@je4
Copy link

je4 commented Dec 16, 2024

I am not completely sure what the target of the PID will be. For files, using a digest works perfectly. But if it's another ocfl object then multiple inventories have to be merged. Eventually I missed something...

@srerickson
Copy link

@je4 the targets for the digests in the refs block would be other objects in the storage root. So, yes, in a sense the inventories would be merged. The spec might read something like:

Each digest in the refs block MUST be found in the root inventory manifest of the referenced object.

@je4
Copy link

je4 commented Dec 17, 2024

I think, this is feasible.

3.5.1
digestAlgorithm: [...] This MUST be the algorithm used in the manifest, state and refs blocks, see the section on Digests for more information about algorithms.

3.5.3.1 Version
state: [...] The keys of this JSON object are digest values, each of which MUST exactly match a digest value key in the manifest or refs of the inventory. [...]

Furthermore:

Reference chain MUST NOT be cyclic

Each digest in the refs block MUST be found in the root inventory manifest key or fixity value of the referenced object.

These two would enable upgrade paths to new digestAlgorithm without breaking references.

Should it be refs or references?

@zimeon
Copy link
Contributor

zimeon commented Jan 9, 2025

I note that there is some related prior work in NIST's "multibag" specification:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Specification Confirmed: In-scope Use case will be included in the upcoming version of the spec or implementation notes.
Projects
None yet
Development

No branches or pull requests

9 participants