-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package per version storage #33
Comments
This could be something along the lines of:
this still leaves three potentially small files per object (though |
potentially a sub-use case of #39 |
Hello everybody! The National Library of Norway is in the process of installing a new bit repository (HPSS) that can hold 44 PB of data. In this context, we are considering using OCFL to organize our data packages. So far, OCFL looks very good, but we are dependent on ZIP per version storage #33 being resolved to be able to use OCFL. This is because we want to limit the number of files so that it becomes more efficient to store/retrieve data from HPSS. I reckon this needs to be solved using an object extension? Do you have any thoughts on how this can be implemented? |
We have begun to think about how this can be implemented based on our needs. This is a very immature first proposal for a new object extension. We would like to discuss the following:
Arguments for allowing more than one file for each version:
What are your initial thoughts?
Example content of archived-versions.json {
"id": "zipped_updates_three_versions_one_file",
"versions": {
"v1": {
"created": "2019-01-01T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"0675bdf376e92e9994612c33ea255b12f7": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-1.zip",
"fileSize": 133410430
},
"0675b1ff76e92e9994612c33ea255b12f7": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-2.zip",
"fileSize": 520430330
},
"067ab1f376e92e9994612c33ea255b12f7": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v1/v1-3.zip",
"fileSize": 8353634100
}
}
},
"v2": {
"created": "2020-02-02T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"5b23ffdf2709bf393a7d8883fcdf583980": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v2/v2-1.zip",
"fileSize": 42644244
}
}
},
"v3": {
"created": "2021-03-03T02:03:04.000Z",
"archiveAlgorithm": {
"mime": "application/zip",
"pronomId": "x-fmt/263"
},
"digestAlgorithm": "sha512",
"files": {
"88492082026f1a3a1c0637d6bd02214dd6": {
"filePath": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-1.zip",
"fileSize": 8743244
},
"3a1c0637d6bd02214dd62c5c19ee8d4bbf": {
"digestHex": "/hpss/storage-root-01/1ec1/8fe/5cd/zipped_updates_three_versions_one_file/v3/v3-2.zip",
"fileSize": 892345
}
}
}
}
} |
I support the idea that any solution for packaged content should include support for multiple packages in a version ( I think the biggest question is where one describes the logical files vs the physical files (packages). I lean toward having the inventory describe the physical files and thus providing the infrastructure for preservation/fixity/transfer, and then create some new way to describe the logical object content in a way that doesn't make those other processes too cumbersome in the case of objects with large numbers of files. This would potentially mean significant changes in the |
If the spec adds support for zipped versions, does it necessarily need to make special mention of split zips, which are already part of the zip spec? |
I'd lean towards 'yes', based on our experiences doing something similar with Preservation Catalog. At the end of the day OCFL tracks files and their checksums. It doesn't know, for example, that a .zip file contains information that points to other zip segments, and we want a human reading the manifest to be able to see that the version directory should contain 10 files ( My early guess is that, in OCFL v2, we'll expand |
I just want to point out that we at NLN do not necessarily want to use split-zips to package small files. We may choose to package them in independent individual zip files. Then they are perhaps a little less prone to problems if one of the zip files should become corrupt. For splitting very large files, split-zips may be appropriate. I therefore see it as an advantage if we do not lock the specification to only support split-zips. |
We'll be sure to not mandate split-zips. We (Stanford) only split on versions greater than 10GB in our (non-OCFL) implementation of archival objects. Anything less than that goes into a single zip file. We'll probably include a way to specify a per-repo or per-object size at which the object-version would be split into multiple zips. |
+1 from the Dataverse community. We're using Bags (1 per version, versions created and archived independently over time) today and are interested in OCFL as a way to reduce storage size (via deduplication/forward deltas) but we'd like to retain the write-only, ~one-file-per-version paradigm we have today. I think that is this use case, although the archived-versions.json file discussed above, where info about all versions is one file, would not be write-only (when versions are added over time.) |
@qqmyers I think we can have an analogous mechanism to the way we treat inventories. Each version could contain a (by definition write-only) copy of the archived-versions.json but there is a separate copy elsewhere that contains the current state. |
Editors' discussion 2023-09-22:
|
I think this suggestion could be really good, and solve how our organization can use OCFL. So to be sure - is the new suggested block at top level or at version level? I have made a proposal where the new package block is at the version level. The only drawback I can think of is that it is only possible to have one checksum for each package file. But that might not be a problem. So, using the example from the OCFL specification:
The same object packed with TAR:
v1.tar will unpack to:
Example of inventory.json with new packageManifest blocks added:
|
This variant could be a bit problematic based on the fact, that inventory.json of the last version (if available) MUST be the same as the inventory.json within the object root. ( https://ocfl.io/1.1/spec/#version-inventory ) Solution could be to get rid of the MUST within the standard or to pack only the content folder of the version, which means, that all inventory.json are aware of the package. |
I'm coming round the the idea of a separate package-inventory.json file. Then we can decide to zip or unzip a version at any time without having a new inventory.json. It's presence/absence would also be an easy indicator of the existence of packaged versions. |
My original thought was to create this as an extension, as I suggested with the archived-versions.json file. I think including this as part of the standard implementation is even better. What are your thoughts on expansion or including it in the standard implementation @neilsjefferies ? |
@ThomasEdvardsen Editors decided there was enough interest and use cases that it this was in-scope for OCFL V2 discussions. |
Feedback on Use CasesIn advance of version 2 of the OCFL, we are soliciting feedback on use cases. Please feel free to add your thoughts on this use case via the comments. Polling on Use CasesIn addition to reviewing comments, we are doing an informal poll for each use case that has been tagged as
The poll will remain open through the end of February 2024. |
We (@ThomasEdvardsen, @je4, and I) have worked on a set of proposals for this use case, along with some questions. You can find them here: |
2024-02-29 Editor's agree that this should be in-scope for v2. Voting at this point is +9 in favor. |
Package Per Version NotesThese notes reference the Package Per Version Use Case, which is Use Case #33. It potentially addresses the issue of lots of small files as well as splitting large files. Package characteristics
Questions
packages.json file
Implementation Notes
packages.json exampleThis strategy replicates the manifest block of the {
"digestAlgorithm": "sha512",
"type": "https://ocfl.io/1.1/spec/#packages",
"manifest": {
"abc..123": [ "v1/v1.zip" ],
"cde..123": [ "v3/v3.z01" ],
"ade..789": [ "v3/v3.z02" ],
"ces..229": [ "v3/v3.zip" ]
},
"versions": {
"v1": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v1/v1.zip"]
},
"v3": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v3/v3.z01", "v3/v3.z02", "v3/v3.zip"]
}
}
The
|
I like the idea of having metadata about the packaging strategy within a We should think how validation can stay huzzle free even if some parts of the object are packaged. Replacing the "versions" with "folders" would follow more the Folder as Package Proposal and allows more flexibility.
{
"digestAlgorithm": "sha512",
"type": "https://ocfl.io/1.1/spec/#packages",
"manifest": {
"abc..123": [ "v1/content.zip" ],
"cde..123": [ "v3.z01" ],
"ade..789": [ "v3.z02" ],
"ces..229": [ "v3.zip" ],
"tuv...375": [ "extensions.zip" ]
},
"folders": {
"extensions": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["tuv...375"]
},
"v1/content": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["abc..123"]
},
"v3": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["cde..123", "ade..789", "ces..229"]
}
}
|
I don't have a preference, but it would be useful for the editors to address how/whether "packaging" should apply to the extensions and logs directories.
@je4 can you elaborate on this? I'm not sure I understand what you mean. |
My thought is that neither
|
i am using a thumbnail extension, which would write thumbnails into extension area, if there would be a chance, that extension or extension-subfolders could be packed. since this is not yet clear i have to write thumbnails into the content area. |
i am looking at these things from the software perspective. and this means, that i like things, which do not have exceptions and things which are easy to describe. if validation or add becomes much more complex and completely different if folders are packaged, then it becomes problematic. the basic idea for an implementation would be to add a virtual filesystem which hides the packaged folders from the logic of the ocfl functionalities (validate, add, update, ...). to achieve this, the if we restrict packaging to version folder, then there could be a future problem, if other parts of the ocfl object must become part of the virtual storage layer. i could even think to rename if there are folders, which must not be packaged, this could be mentioned within the standard instead of restricting the json file. |
It has occurred to me that we need to think carefully about what happens when a package file is corrupted or deleted. It will interact with the proposed tombstoning mechanism in some way, and may need some indication in the package inventory. |
indeed, this has to be addressed. the same procedure, which would be done to manifest.json if files are corrupt or not available, will apply packages.json. therefor we should initially decide on Support physical file-level deletion #42 and then apply to packages.json the same strategy |
In cases where there are many small files in an object or where the storage infrastructure is not efficient at handling many files, it is useful to package files using a technology such as ZIP. This is addressed for the whole object in #10. However, packaging the whole object as a ZIP/Tar etc. breaks the idea of immutability of version data. One could instead package the inventory and content for each new version as a new ZIP file.
The text was updated successfully, but these errors were encountered: