Richer dataset manifests #865

gmega · 2024-07-31T17:11:26Z

As part of the improvements required for the Codex app, we want to store in the manifests:

MIME type;
file name;
upload timestamp.

2-towns · 2024-08-01T15:57:32Z

Do you think we should add the uploaded date / time ?

gmega · 2024-08-01T16:30:50Z

Yeah, if that's not tracked elsewhere already, sounds like a good idea.

2-towns · 2024-10-16T16:05:21Z

For the filename, I think we will need to pass it from the client side because we are using a binary string for the upload and it is not available anywhere else, unlike the multipart form data where the filename is available.

For the MIME type, I couldn't find a way to retrieve it from nim-presto or any other source, so I think we should detect it from the magic numbers in the buffer. We can use the package available at https://github.com/jiro4989/filetype, or we can implement it ourselves if we prefer not to import an external package. It's not complicated; we would just need an exhaustive list of MIME types along with their associated magic numbers.

What are your thoughts on this? @gmega @benbierens

2-towns · 2024-10-16T16:54:42Z

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options
https://www.keycdn.com/support/what-is-mime-sniffing

cc @gmega

2-towns · 2024-10-21T08:45:21Z

I had a chat with @benbierens to understand the implications of storing metadata in the manifest, particularly the upload timestamp. It seems that this will generate a new CID for the same data without affecting the tree CIDs. Here’s the scenario to ensure we’re on the same page:

1- A user uploads data X at time T1, which produces CID1.
2- The same user uploads the same data X at time T2, resulting in CID2.
3- CID1 and CID2 are different but point to the same tree CIDs (I believe).

I’m not sure if this is fine, so I want to ensure we are aware of this and make a decision with this understanding.

cc @gmega @AuHau

AuHau · 2024-10-22T13:35:17Z

Yes that is true. IMHO, that is fine.

Maybe it would be interesting to know if Codex is announcing the tree CID on DHT? @gmega do you know? It would effect the deduplication properties in the network if not.

benbierens · 2024-10-23T07:07:51Z

Codex announced the tree CID and the manifest CID. The tree CID would be identical for the same data, but the manifest would be different depending on upload timestamp.

I'm not sure how I feel about this one. I think we have stuff built on the idea that uploading the same data always gives the same CID. This is true for the tree CID but not the manifest CID, which is (and has to be) the one returned from the upload call. We could consider returning the manifest JSON from the upload call.... What do you think?

AuHau · 2024-10-25T07:25:18Z

IMHO it is alright as long both "manifest CID" and "tree CID" are announced on the network and the user can access both of these through API, which I believe is the case thanks to Ben's latest manifest API - sooo I think we are good ;-)

dryajov · 2024-12-02T21:51:01Z

I know that we've already added this, but I strongly believe that the timestamp is unnecessary and generating a new manifest just because if was uploaded at a different is a bit excessive. I'm not following the reasoning for why we want this.

2-towns · 2024-12-03T08:46:20Z

I know that we've already added this, but I strongly believe that the timestamp is unnecessary and generating a new manifest just because if was uploaded at a different is a bit excessive. I'm not following the reasoning for why we want this.

I believe the primary reason was to enhance user experience (UX): each file has an associated timestamp, making it easy to sort the files and identify the most recently uploaded ones.

benbierens · 2024-12-04T09:40:55Z

We have added this some time ago, and I have been using it. Now I think I would propose we roll it back.
I very much want to have the same CID when I upload the same data. It is costing significant time/resources being unable to identify duplicate datasets from their CIDs. Yes you can identify them by fetching the manifest info and comparing the treeCIDs. But these are extra steps I have to take manually after I have already uploaded a dataset: I get a new CID, I have to fetch the manifest content. I have to get the tree CID and then I have to compare it to every other manifest in my node to see if I actually already uploaded this one before.

I've been trying to think of a way to "attach" an upload-timestamp to a manifest without it being "in" the manifest. The problem is that such "extra data" can't be hash-verified and so it's probably not wise to transmit it via the p2p layer. People could send you garbage and you have no way to check it. So, then what's the use-case for it?

2-towns · 2024-12-04T10:07:26Z

I very much want to have the same CID when I upload the same data.

Just a note, if you want that, you need to remote all the metadata, because you could have multiple upload with different name / content-type for the same data.

AuHau · 2024-12-04T10:12:20Z

Also to note that the data themselves (eq. the big piece of the dataset), will be actually deduplicated, it is only the the manifest that will be "duplicated". But I hear you that it would be nice to get the same CID when you upload the same file twice. I am also not sure if the benefit of having a timestamp overweights this benefit.

emizzle · 2024-12-05T01:28:13Z

I agree with Arnaud that changing any metadata, not just the upload timestamp, would affect the CID and probably shouldn't be included.

I personally see metadata as a SHOULD HAVE and not necessarily MUST HAVE for the protocol to operate correctly. One option would be to store the metadata information locally in the repostore and have it exchanged along with manifest blocks. My initial guess is that the relationship between the metadata and manifest can be maintained by having metadata structure contain the manifest CID. The metadata structure itself could be hashed for integrity, allowing multiple metadata infos per dataset (imagine changing the filename in the UX would create a separate metadata structure but not a new manifest...). The metadata exchanged with the manifest would be the most recent one, or worst case, none at all (it's not a MUST HAVE).

benbierens · 2024-12-05T08:04:16Z

I like Eric's ideas here. I just worry that the fields of the metadata will open up some vectors for attack/abuse of the system. It feels like there's an opportunity for some serious software-designing here. :D

gmega added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Sep 18, 2024

gmega changed the title ~~Store file names and MIME types as part of Manifest~~ Richer dataset manifests Sep 18, 2024

gmega assigned 2-towns Oct 9, 2024

2-towns mentioned this issue Oct 18, 2024

feat: add metadata to the manifest #960

Merged

2-towns linked a pull request Oct 18, 2024 that will close this issue

feat: add metadata to the manifest #960

Merged

gmega mentioned this issue Oct 22, 2024

Manifest versioning #965

Open

gmega closed this as completed in #960 Oct 25, 2024

dryajov reopened this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Richer dataset manifests #865

Richer dataset manifests #865

gmega commented Jul 31, 2024 •

edited

Loading

2-towns commented Aug 1, 2024

gmega commented Aug 1, 2024

2-towns commented Oct 16, 2024

2-towns commented Oct 16, 2024

2-towns commented Oct 21, 2024 •

edited

Loading

AuHau commented Oct 22, 2024

benbierens commented Oct 23, 2024

AuHau commented Oct 25, 2024

dryajov commented Dec 2, 2024

2-towns commented Dec 3, 2024

benbierens commented Dec 4, 2024

2-towns commented Dec 4, 2024

AuHau commented Dec 4, 2024

emizzle commented Dec 5, 2024

benbierens commented Dec 5, 2024

Richer dataset manifests #865

Richer dataset manifests #865

Comments

gmega commented Jul 31, 2024 • edited Loading

2-towns commented Aug 1, 2024

gmega commented Aug 1, 2024

2-towns commented Oct 16, 2024

2-towns commented Oct 16, 2024

2-towns commented Oct 21, 2024 • edited Loading

AuHau commented Oct 22, 2024

benbierens commented Oct 23, 2024

AuHau commented Oct 25, 2024

dryajov commented Dec 2, 2024

2-towns commented Dec 3, 2024

benbierens commented Dec 4, 2024

2-towns commented Dec 4, 2024

AuHau commented Dec 4, 2024

emizzle commented Dec 5, 2024

benbierens commented Dec 5, 2024

gmega commented Jul 31, 2024 •

edited

Loading

2-towns commented Oct 21, 2024 •

edited

Loading