-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Richer dataset manifests #865
Comments
Do you think we should add the uploaded date / time ? |
Yeah, if that's not tracked elsewhere already, sounds like a good idea. |
For the filename, I think we will need to pass it from the client side because we are using a binary string for the upload and it is not available anywhere else, unlike the multipart form data where the filename is available. For the MIME type, I couldn't find a way to retrieve it from nim-presto or any other source, so I think we should detect it from the magic numbers in the buffer. We can use the package available at https://github.com/jiro4989/filetype, or we can implement it ourselves if we prefer not to import an external package. It's not complicated; we would just need an exhaustive list of MIME types along with their associated magic numbers. What are your thoughts on this? @gmega @benbierens |
I had a chat with @benbierens to understand the implications of storing metadata in the manifest, particularly the upload timestamp. It seems that this will generate a new CID for the same data without affecting the tree CIDs. Here’s the scenario to ensure we’re on the same page: 1- A user uploads data X at time T1, which produces CID1. I’m not sure if this is fine, so I want to ensure we are aware of this and make a decision with this understanding. |
Yes that is true. IMHO, that is fine. Maybe it would be interesting to know if Codex is announcing the tree CID on DHT? @gmega do you know? It would effect the deduplication properties in the network if not. |
Codex announced the tree CID and the manifest CID. The tree CID would be identical for the same data, but the manifest would be different depending on upload timestamp. I'm not sure how I feel about this one. I think we have stuff built on the idea that uploading the same data always gives the same CID. This is true for the tree CID but not the manifest CID, which is (and has to be) the one returned from the upload call. We could consider returning the manifest JSON from the upload call.... What do you think? |
IMHO it is alright as long both "manifest CID" and "tree CID" are announced on the network and the user can access both of these through API, which I believe is the case thanks to Ben's latest manifest API - sooo I think we are good ;-) |
I know that we've already added this, but I strongly believe that the timestamp is unnecessary and generating a new manifest just because if was uploaded at a different is a bit excessive. I'm not following the reasoning for why we want this. |
I believe the primary reason was to enhance user experience (UX): each file has an associated timestamp, making it easy to sort the files and identify the most recently uploaded ones. |
We have added this some time ago, and I have been using it. Now I think I would propose we roll it back. I've been trying to think of a way to "attach" an upload-timestamp to a manifest without it being "in" the manifest. The problem is that such "extra data" can't be hash-verified and so it's probably not wise to transmit it via the p2p layer. People could send you garbage and you have no way to check it. So, then what's the use-case for it? |
Just a note, if you want that, you need to remote all the metadata, because you could have multiple upload with different name / content-type for the same data. |
Also to note that the data themselves (eq. the big piece of the dataset), will be actually deduplicated, it is only the the manifest that will be "duplicated". But I hear you that it would be nice to get the same CID when you upload the same file twice. I am also not sure if the benefit of having a timestamp overweights this benefit. |
I agree with Arnaud that changing any metadata, not just the upload timestamp, would affect the CID and probably shouldn't be included. I personally see metadata as a SHOULD HAVE and not necessarily MUST HAVE for the protocol to operate correctly. One option would be to store the metadata information locally in the repostore and have it exchanged along with manifest blocks. My initial guess is that the relationship between the metadata and manifest can be maintained by having metadata structure contain the manifest CID. The metadata structure itself could be hashed for integrity, allowing multiple metadata infos per dataset (imagine changing the filename in the UX would create a separate metadata structure but not a new manifest...). The metadata exchanged with the manifest would be the most recent one, or worst case, none at all (it's not a MUST HAVE). |
I like Eric's ideas here. I just worry that the fields of the metadata will open up some vectors for attack/abuse of the system. It feels like there's an opportunity for some serious software-designing here. :D |
As part of the improvements required for the Codex app, we want to store in the manifests:
The text was updated successfully, but these errors were encountered: