Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richer dataset manifests #865

Open
gmega opened this issue Jul 31, 2024 · 15 comments · Fixed by #960
Open

Richer dataset manifests #865

gmega opened this issue Jul 31, 2024 · 15 comments · Fixed by #960
Assignees
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details

Comments

@gmega
Copy link
Member

gmega commented Jul 31, 2024

As part of the improvements required for the Codex app, we want to store in the manifests:

  1. MIME type;
  2. file name;
  3. upload timestamp.
@2-towns
Copy link
Contributor

2-towns commented Aug 1, 2024

Do you think we should add the uploaded date / time ?

@gmega
Copy link
Member Author

gmega commented Aug 1, 2024

Yeah, if that's not tracked elsewhere already, sounds like a good idea.

@gmega gmega added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Sep 18, 2024
@gmega gmega changed the title Store file names and MIME types as part of Manifest Richer dataset manifests Sep 18, 2024
@2-towns
Copy link
Contributor

2-towns commented Oct 16, 2024

For the filename, I think we will need to pass it from the client side because we are using a binary string for the upload and it is not available anywhere else, unlike the multipart form data where the filename is available.

For the MIME type, I couldn't find a way to retrieve it from nim-presto or any other source, so I think we should detect it from the magic numbers in the buffer. We can use the package available at https://github.com/jiro4989/filetype, or we can implement it ourselves if we prefer not to import an external package. It's not complicated; we would just need an exhaustive list of MIME types along with their associated magic numbers.

What are your thoughts on this? @gmega @benbierens

@2-towns 2-towns linked a pull request Oct 18, 2024 that will close this issue
@2-towns
Copy link
Contributor

2-towns commented Oct 21, 2024

I had a chat with @benbierens to understand the implications of storing metadata in the manifest, particularly the upload timestamp. It seems that this will generate a new CID for the same data without affecting the tree CIDs. Here’s the scenario to ensure we’re on the same page:

1- A user uploads data X at time T1, which produces CID1.
2- The same user uploads the same data X at time T2, resulting in CID2.
3- CID1 and CID2 are different but point to the same tree CIDs (I believe).

I’m not sure if this is fine, so I want to ensure we are aware of this and make a decision with this understanding.

cc @gmega @AuHau

@AuHau
Copy link
Member

AuHau commented Oct 22, 2024

Yes that is true. IMHO, that is fine.

Maybe it would be interesting to know if Codex is announcing the tree CID on DHT? @gmega do you know? It would effect the deduplication properties in the network if not.

@benbierens
Copy link
Contributor

Codex announced the tree CID and the manifest CID. The tree CID would be identical for the same data, but the manifest would be different depending on upload timestamp.

I'm not sure how I feel about this one. I think we have stuff built on the idea that uploading the same data always gives the same CID. This is true for the tree CID but not the manifest CID, which is (and has to be) the one returned from the upload call. We could consider returning the manifest JSON from the upload call.... What do you think?

@AuHau
Copy link
Member

AuHau commented Oct 25, 2024

IMHO it is alright as long both "manifest CID" and "tree CID" are announced on the network and the user can access both of these through API, which I believe is the case thanks to Ben's latest manifest API - sooo I think we are good ;-)

@dryajov
Copy link
Contributor

dryajov commented Dec 2, 2024

I know that we've already added this, but I strongly believe that the timestamp is unnecessary and generating a new manifest just because if was uploaded at a different is a bit excessive. I'm not following the reasoning for why we want this.

@dryajov dryajov reopened this Dec 2, 2024
@2-towns
Copy link
Contributor

2-towns commented Dec 3, 2024

I know that we've already added this, but I strongly believe that the timestamp is unnecessary and generating a new manifest just because if was uploaded at a different is a bit excessive. I'm not following the reasoning for why we want this.

I believe the primary reason was to enhance user experience (UX): each file has an associated timestamp, making it easy to sort the files and identify the most recently uploaded ones.

@benbierens
Copy link
Contributor

We have added this some time ago, and I have been using it. Now I think I would propose we roll it back.
I very much want to have the same CID when I upload the same data. It is costing significant time/resources being unable to identify duplicate datasets from their CIDs. Yes you can identify them by fetching the manifest info and comparing the treeCIDs. But these are extra steps I have to take manually after I have already uploaded a dataset: I get a new CID, I have to fetch the manifest content. I have to get the tree CID and then I have to compare it to every other manifest in my node to see if I actually already uploaded this one before.

I've been trying to think of a way to "attach" an upload-timestamp to a manifest without it being "in" the manifest. The problem is that such "extra data" can't be hash-verified and so it's probably not wise to transmit it via the p2p layer. People could send you garbage and you have no way to check it. So, then what's the use-case for it?

@2-towns
Copy link
Contributor

2-towns commented Dec 4, 2024

I very much want to have the same CID when I upload the same data.

Just a note, if you want that, you need to remote all the metadata, because you could have multiple upload with different name / content-type for the same data.

@AuHau
Copy link
Member

AuHau commented Dec 4, 2024

Also to note that the data themselves (eq. the big piece of the dataset), will be actually deduplicated, it is only the the manifest that will be "duplicated". But I hear you that it would be nice to get the same CID when you upload the same file twice. I am also not sure if the benefit of having a timestamp overweights this benefit.

@emizzle
Copy link
Contributor

emizzle commented Dec 5, 2024

I agree with Arnaud that changing any metadata, not just the upload timestamp, would affect the CID and probably shouldn't be included.

I personally see metadata as a SHOULD HAVE and not necessarily MUST HAVE for the protocol to operate correctly. One option would be to store the metadata information locally in the repostore and have it exchanged along with manifest blocks. My initial guess is that the relationship between the metadata and manifest can be maintained by having metadata structure contain the manifest CID. The metadata structure itself could be hashed for integrity, allowing multiple metadata infos per dataset (imagine changing the filename in the UX would create a separate metadata structure but not a new manifest...). The metadata exchanged with the manifest would be the most recent one, or worst case, none at all (it's not a MUST HAVE).

@benbierens
Copy link
Contributor

I like Eric's ideas here. I just worry that the fields of the metadata will open up some vectors for attack/abuse of the system. It feels like there's an opportunity for some serious software-designing here. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants