-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCFL Object Forking #44
Comments
Depending on the underlying storage system you have in place your disks may be doing this deduplication transparently. How do you handle this case at the moment? |
Our storage system doesn't handle it (it's http://eos.web.cern.ch with some 400PB of disk space). Essentially if e.g. hard symlinks where allowed, some system operating on the OCFL objects probably even wouldn't know that it's deduplicated. The problem is with either the requirement on not using hard links:
or with the assumed linear versioning. Note, I've been discussing this with @neilsjefferies IRL as well. |
A reference to file/datastream in another OCFL object could solve the issue. My general thinking here is that a reference to file/datastream anywhere is not a good idea, but that instead it should be constrained to the OCFL storage root. I fully understand that you want to get v1 out the door. Just know that this is kind of a show stopper for using OCFL for us, so a quick v2 release afterwards would be much appreciated. We have 1.4 million OCFL objects and 300TB of data to write, so I'd prefer not having to rewrite them :-) Obviously, I'm happy to help out, in case there's anything I can do to accelerate it. |
Thanks, @lnielsen.
It is conceivable that separate versions of a single OCFL Object could have their own DOIs. |
@awoods It's related to the two levels of versioning that I call versioning and revisions and that they can happen in different sequences (e.g. I'll try to see if I can give a clear example 😄 and of course don't hesitate to let me know if there's something obvious that I just haven't seen. If I change my initial example to use a single OCFL object it would look like this (after the three actions):
So far so good. I've managed to represent the changes in an OCFL object. Now let's switch the order of actions from 1, 2, 3 to 1, 3, 2. My OCFL object would instead look like this:
So far so good as well. I've achieved deduplication of the big file. The problem I see with this structure is that it's non-trivial/non-intuitve to find the latest state of a specific DOI, and thus requires interpretation on top of OCFL in order to be understandable. The reason for using OCFL in the first place, is to have an self-evident structure that requires no other knowledge than OCFL. Similarly, I could also imagine hacks to make things work like writing a completely new OCFL object and deleting the old one. But then performance would be an issue. |
Hi @lnielsen! We have this issue at Stanford ("In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.") and don't have a perfect solution, but have approached it in two ways:
|
Copies from use-cases...general musing...so not completely thought out, I can imagine a minor modification to the inventory that adds "inherits from ObjectID" type sections to the manifest. The digests that follow identify paths in other OCFL object(s). Other than that nothing else needs to change. When copying an object, parsing the manifest tells you which additional objects it has dependencies on. It would permit version forking and inter-object deduplication. this does mean that if object versions are not stored as a single units then each version has a new ID - this is not necessarily a bad thing. ...this might also be adapted to include "Inherits from external_storage_path" in some form. |
|
Feedback on Use CasesIn advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments. Polling on Use CasesIn addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as
The poll will remain open through the end of February 2024. |
This could be quite complicated if #42 (file deletion) also makes it into v2. Implementations would need to handle (or prevent) deletion of inherited files in the parent. Bi-directional references (both child-to-parent and parent-to-child) would make it easier to understand down stream consequences of a file deletion. |
Just adding a new key "inherits", containing a list of object ids including version, to the basic inventory structure should not be problematic and won't interfere with any other features. on the same level, there could be a "deprecates" key too. |
At the time of this comment the vote tallied to +3. Confirming this as in scope for version 2 -- of course how to do that is still a question. |
Object Forking Notes (File Inheritance)These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.
When a parent object is deleted:
When a referenced file is deleted in a parent object:
Question:
When a file is corrupted in a parent object:
A full inventory.json example of file inheritance{
"digestAlgorithm": "sha512",
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/content/foo/bar.xml" ],
"7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
"df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
"ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2018-01-01T01:01:01Z",
"message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Bob"
}
},
"v3": {
"created": "2018-03-03T03:03:03Z",
"message": "Reinstate image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Cecilia"
}
}
}
} |
Would it be a solution to change the manifest definition from
to
This would mean, that there's just an URI check (a colon in the string) needed to figure out, whether the file is inside the OCFL object or remote. |
I agree with @je4: my preference would be to avoid designs where a schema value can have more than one possible type (i.e., string or json object). Besides the suggestion from @je4, above, another approach would be to define manifest values as objects like: {
"4d27c8...b53": { "paths": ["v2/content/foo/bar.xml"] },
"7dcc35...c31": { "id": "ark:/67890/fgh123" },
"df83e1...a3e": { "id": "ark:/67890/fgh123" },
"ffccf6...62e": { "id": "ark:/67890/fgh123" }
} |
Yet another approach: ...
"manifest": {
"4d27c8...b53": ["v2/content/foo/bar.xml"],
},
"refs": {
"7dcc35...c31": "ark:/67890/fgh123",
"df83e1...a3e": "ark:/67890/fgh123",
"ffccf6...62e": "ark:/67890/fgh123"
}
... The idea here is to add a new key in the inventory (e.g., |
After some more thought, I think I prefer an approach where the structure and semantics of the The current spec uses the "digest to array of paths" mapping in three places (manifest, version state, and fixity), and I think it's a design strength that the same pattern is repeated in multiple places. Changing the manifest block would be a step in the wrong direction, I feel. |
Thanks, @srerickson . I appreciate that perspective. |
I like this - also the presence of a |
I am not completely sure what the target of the PID will be. For files, using a digest works perfectly. But if it's another ocfl object then multiple inventories have to be merged. Eventually I missed something... |
@je4 the targets for the digests in the
|
I think, this is feasible.
Furthermore:
These two would enable upgrade paths to new Should it be |
I note that there is some related prior work in NIST's "multibag" specification:
|
In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call
In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.
They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.
Example
Imagine these actions:
data-01.zip
andmishap.zip
data-02.zip
(files is thus:data-01.zip
anddata-02.zip
).mishap.zip
from 10.5281/zenodo.1234The OCFL objects would be:
What I would like is not having to duplicate
data-01.zip
in10.5821/zenodo.4321
OCFL object?Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?
The text was updated successfully, but these errors were encountered: