-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naming blocks in the datastore #242
Comments
If we were to start storing metadata with a block, part of that could be whether it's pinned (and how - direct, indirect, multiples thereof, etc), which would make the pinning code a lot simpler and (arguably) gc way faster as it would be trivial to parallelise. |
That makes pinning maintenance horrible because every time you unpin something you need to check if the item is pinned by something else, update all the items etc. |
I would like to stress that proposal 1 does not change much from the current defaults. We are effectively indexing by multihash already when using CidV0. If we were indexing using CIDv1s we would also have a situation where the codecs don't give us meaningful information for most of the blocks in the store (raw leaves, dag-pb chunks leafs). In the end any content used on IPFS needs an address, that's the CID and you can derive everything from there. I don't see too much value in the ability to know the codec for every block in the datastore, specially having already a list of root CIDs stored separately that we can derive information from (the pinset/mfs). So what do we expect to get from having that structural info for everything? If we choose that way, it should be because it is needed for a very specific feature that we want to support (Option 1 is there as support for the switch to base32). |
Why not store the blobs by raw multihash, and have an additional table that has just the cids. That way you don't burden the blob store with unnecessary information but still retain all the metadata. |
At the moment when we want to GC something we have to check every direct pin and every child of every recursive pin to see if the block is pinned which seems quite similar to the above. My assumption is that GC operations will run more frequently than unpinning so it might make sense to optimise for that use case. |
In our usage unpinning is much much more common than GC. We periodically call GC, whereas every time any user modifies any file they pin their new root (with a pin update) and then unpin the old one. |
That's a variant of solution 4. The concern there is that we'd now have multiple writes under different keys, every time we write a block.
I agree we'll need to do something like this eventually, but ideally only for pinned blocks. That is, I'm more-fine having this kind of overhead when pinning, but less fine when just adding random blocks (but maybe it's still fine)>
There are ways to optimize this. When we unpin, we'd need to traverse all newly-unreferenced blocks and when we pin, we'd need to traverse all newly-referenced blocks. However, we can probably do this asynchronously by recording things we need to do in the datastore, and making sure we work through the backlog before we GC. |
I agree for dag-pb. I think the main concern here is that, when we start getting more and more CBOR blocks, CIDs start becoming more useful. However, in my personal opinion, everything that is unpinned is just "cached" and being able to enumerate a cache is not a desirable feature. |
Said otherwise, we could support an "everything-pinned" mode. i.e. an additional pinset where everything that is written goes, but does not necessarily need to be enabled by default, only for the cases where cache enumeration is important. Moving to datastore-backed pinset should enable us to do this more or less easily. I guess what I mean is that we can separate this into two problems and solve the problem of cache enumeration on top of storing raw multihashes, and do it at a later point in time, rather than now (and that complexity may be better managed with that approach). |
I agree. This would actually be really nice because we'd be able to GC better. That is, we'd be able to:
|
I think this is stale at this point and we have surfaced concerns and identified how to potentially address them later as needed. Can we add this to 0.6.0 milestone? |
Resolution
|
My opinion: Since the moment I have started getting involved with IPFS, CIDs were always trumped as content addressors; a interpretive codec followed by a raw data identifier, it was my assumption that CIDs only existed above the actual storage layer, and would mostly be used for "user-facing" data-address resolving and interpreting To use CIDs in blockstore would carry unnecessary "addressing flair" (codec + future additions to CIDvX) to the raw data, and raw data could exist in duplicate under different CIDs, with the same internal multihash. My assumption was thus that multihashes were used from the start, as any other alternative would be unreasonable concerning how the whole of IPFS' system works, I was surprised to see that CIDs were used instead, so my opinion is to immediately make plans to migrate this. I'm honestly a bit confused as to why this was implemented this way in the first place. |
There was a desire to not lose information. That is, being able to list and understand all blocks in a datastore as structure data is nice. See the first post for the "desired properties". On the other hand, I completely agree that referencing by multihash at this layer makes the most sense. |
If that's true, then an extra list or "store" for CID keys that have been ever seen across the whole client (and network) could help this problem, CIDs can then act as "headers" or indexes in the way that filesystems work today, the raw multihash-referenced blocks could then act as the "raw data" on "disk". With storing all CIDs seen across the network, maybe later on then a protocol could be specified that "repairs" or performs "reverse lookup" based on a raw set of blocks, to find out how the data fitted together, in a repair operation, with the help of the network by asking them (and possibly also placing this in the DHT) if they have seen CIDs with this multihash in them. |
go-ipfs 0.12 is when the switch of the low level datastore to use multihash keys happens: ipfs/kubo#6816 / ipfs/kubo#8344 Do we have any specs or docs that need updating, or can this issue be closed? |
Context: https://github.com/ipfs/ipfs/issues/337
Currently, both js-ipfs and go-ipfs index blocks in the datastore by CID. Unfortunately, it's possible for a single block to have multiple CIDs due to (a) different CID versions and (b) due to different multicodecs (e.g., dag-cbor v. cbor v. raw).
The primary concern is (a). As we switch to returning CIDv1 (for multibase support), we still want to be able to lookup blocks that were fetched/added as CIDv0.
Currently, when looking up a CID, both go-ipfs and js-ipfs will first attempt to lookup the CID under the original CID version, then under the other CID version. However, this costs us two look-ups.
Primary use-case:
Ensure that CIDv1 and CIDv0 can be used interchangeably, especially in the gateway.
Proposals:
Desired properties:
a. CID Versions differ: 1, 2 & 3
b. CID Codecs differ: 1 & 3.
c. Hash functions differ: 3.
a. Time: 1, 2, & 4.
b. Space: 1 & 2.
The current consensus is option 1 (multihash) (ipfs/kubo#6815, ipfs/js-ipfs#2415). The proposal here is to consider option 2 (and at least look at the others).
However, option 1 doesn't give us property 2 as it discards the codec on write. @jbenet has objected strongly to this as we will be discarding structural information about the data. This information will still be stored in the pin set but we'll be losing this information for unpinned data.
We have also run into this same issue when trying to write a reverse datastore migration, migrating back from multihashes to CIDs: we need to somehow recover the codecs. The reverse migration in Option 2 would simply be: do nothing.
We need to consider the pros/cons of switching to option 1 before proceeding.
Side note: why should we even care about (b), multiple codecs for the same block?
IPLD objects as "files"
Dumb block transport
I might want to take an unbalanced DAG (blockchain), a DAG with weird codecs (git, eth, etc.), etc. and sync it with another node. It would be nice if I could take this DAG, treat all the nodes as raw nodes, then build a well-balanced "overlay-dag" using simple, well-supported codecs.
This might, for example, be useful for storing DAGs with custom codecs in pinning services.
The text was updated successfully, but these errors were encountered: