Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

CAR format spec doc #230

Closed
wants to merge 5 commits into from
Closed

CAR format spec doc #230

wants to merge 5 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jan 3, 2020

Builds on @mishmosh's work in #229 for a complete spec.

Best reviewed as rendered MD: https://github.com/ipld/specs/blob/rvagg/car/block-layer/content-addressable-archives.md

I'm confident in the technical details, but when reviewing I'd like you to consider the following items that I've codified:

  1. The name, as per discussion in stub for car file format #229, what should it be? I've gone with "Content Addressable aRchive" but am happy to be overridden if there is existing lore about the name I'm not aware of.
  2. The roots don't need to have a strict relationship with the blocks other than them existing somewhere in the CAR. They may describe complete or partial DAGs comprising all blocks in a CAR or they may describe DAGs formed by some of the blocks in a CAR with the rest being disconnected, or there may be no roots at all (as per https://github.com/ipfs/go-car/issues/19). I've done this in my JavaScript CAR implementation but go-car doesn't and perhaps it breaks some original intentions of the format?
  3. An empty CAR is not ruled out, i.e. no roots, no blocks, just a header block: {version:1,roots:[]}. We could rule it out.
  4. Duplicate blocks are not ruled out: "The possibility of duplicate blocks in a single CAR (such as for padding) is currently not specified." They may be useful for easy padding (also discussion of padding included in doc), we've discussed the possibility of padding for lining up blocks to Filecoin window boundaries but haven't gone far down that rabbit hole. Of course padding can also be done without duplicate blocks too, I just don't see a reason to codify strict no-duplicates.

/R @mishmosh @whyrusleeping @dignifiedquire @Kubuxu @Stebalien

mishmosh and others added 2 commits January 3, 2020 15:37
@rvagg rvagg requested review from mikeal, vmx and warpfork January 3, 2020 05:03
Copy link
Contributor

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding how Filecoin intends to use CARs, the main issue we need to address is that CARs need to be, at a minimum, deterministic. Specifically, we need to specify exactly how to compute a "CAR of a DAG (selector?)" such that we always get the same result, byte by byte.

* The `version` is always a value of `1`. Future iterations of this specification may make use of `version` to introduce variations of the format.
* The `roots` array must contain **zero or more** CIDs, each of which must be present somewhere in the remainder of the CAR.

An empty `roots` array is valid, but indicates that the CAR does not necessarily contain blocks forming a coherent DAG. Such a CAR may, for instance, form part of a collection of archives that, when combined, contain an entire DAG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this? Ideally, the CAR header would be enough to reconstruct the CAR.

It would be nice to make this somewhat deterministic, at least. Something like: if some roots are listed, all roots must be listed.

* **Streaming**: the CAR format is ideal for dumping blocks via streaming reads as the Header can be loaded first and minimal state is required for ongoing parsing.
* **Individual block reads**: as the CAR format contains no index information, reads require either a partial scan to discover the location of a required block or an external index must be maintained and referenced for a seek and partial read of that data. See below regarding indexing.
* **DAG traversal**: without an external index, traversal of a DAG specified by a "root" CID is not possible without dumping all blocks into a more convenient data store or by partial scans to find each block as required, which will likely be too inefficient to be practical.
* **Modification**: CARs may be appended after initial write as there is no constraint in the Header regarding total length. Care must be taken in appending if a CAR is intended to contain coherent DAG data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, we need to define a canonical CAR variant such that a CAR of a DAG is deterministic.

(especially if we intend to use them in Filecoin sectors)


The roots specified by the Header of a CAR must appear somewhere in its Data section, however there is no requirement that the roots define entire DAGs, nor that all blocks in a CAR must be part of DAGs described by the root CIDs in the Header. Therefore, the roots must not be used alone to determine or differentiate the contents of a CAR.

The CAR format contains no internal means, beyond the IPLD block formats and their CIDs, to verify or differentiate contents. Where such a requirement exists, this must be performed externally, such as creating a digest of the entire CAR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, CAR originally stood for for certified archive because it was supposed to be self-certified/validating.


### Indexing and Seeking Reads

The CAR format contains no internal indexing, any indexing must be stored externally to a CAR. However, such indexing is possible and makes seeking reads practical. An index storing byte offset (of section start or block data start) and length (of section or block data), keyed by CID, will enable a single block read by seeking to the offset and reading the block data. The format of any such index is not specified here and is left up to CAR format parsing implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be stored in the header, couldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, a version 2 should include indexing information, it would just need additional up-front processing to make a CAR with an index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bump the version for that. Version bumps should be for layout changes.

But otherwise, yeah, we can deal with that later.


### Header

The first bytes of the CAR format hold a varint, this unsigned integer specifies the number of bytes beyond the varint itself that contain the _Header_ block. This Header block is a byte array DAG-CBOR (CBOR with tag 42 for CIDs) encoded object holding the version number and array of roots. As an [IPLD Schema](../schemas/):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we could just make this entire thing recursive. That is, the entire CAR could be:

[ varint | CID | BLOCK ] *

Where the first block is always the header object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make a nice symmetry. Unfortunately the version information is in the header block and there's no CID before that block, so if we had a v2 that introduced a CID for the header block then our parsers would need to be able to figure out whether the bytes after the initial varint are a CID or the start of a CBOR block (map['version':x,...). Is the difference between the start of a CID and the start of a CBOR map distinct enough for such branching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to make that version 1. If we also required that the root block list all root blocks, the CAR would actually be a single DAG and would be really easy to validate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. Reading the header would be more work (as you would also need to read its CID), but that should be negligible compared to making parsing easier as it's less of a special case.

@rvagg
Copy link
Member Author

rvagg commented Jan 6, 2020

Responding to most of @Stebalien's inline comments here because they revolve around this key point:

CARs need to be, at a minimum, deterministic

That would explain "certified".

In https://github.com/ipfs/go-car/issues/19 we originally discussed not having any roots and allowing it to be a store of arbitrary blocks. The primary motivation was that we want to produce CARs that have a max size containing data for very large data structures (e.g. if we store very large DAGs in Filecoin we're limited by sector size). So, can we have a parent CAR that contains one or more roots and its DAG spans outside of the CAR to blocks contained "elsewhere" (the connection piece of this puzzle is yet to be determined).

If deterministic is a strict requirement for CARs then I'll change a bunch of this spec to make that the case:

  1. mandatory roots
  2. blocks must be part of a DAG described by the roots
    a. can there be unreferenced parts of a DAG? i.e. can the nodes have links to content outside of this DAG? This would rule out a lot of IPLD data structures.
  3. no modification because there's only one-true-dag for roots
  4. need to describe DAG walking algorithm/order to make sure the blocks go in the same order regardless of codecs and implementation

But that doesn't help with the size problem for very large DAGs.

Or do we make multiple CAR modes? Where we could introduce a "certified" mode that has a fully self-contained DAG starting at its roots, and a "lax" mode that is just a format for storing arbitrary blocks.

@mikeal
Copy link
Contributor

mikeal commented Jan 6, 2020

I don’t think it’s all that practical to ensure the format is deterministic without making huge sacrifices in other areas.

For instance, how does the format ensure the blocks are in a particular order? We could add such a requirement, but then anyone who writes to the format won’t be able to stream to it, they’ll need to prepare all the data some other way and then write blocks in an order that ensured to be compliant with the spec.

We need to reconcile the use cases we’re building this for before we make statements like “must” about features we don’t even have yet. There’s a lot of tradeoffs to be made here and the use case that has pushed us to even write up the spec and build another implementation isn’t visible to anyone but those of us currently working on it.

@Stebalien
Copy link
Contributor

Read:

2.a. can there be unreferenced parts of a DAG? i.e. can the nodes have links to content outside of this DAG? This would rule out a lot of IPLD data structures.

That's an interesting question. One solution would be to use selectors. Looks like there was an attempt: https://github.com/ipfs/go-car/pull/18.

But that doesn't help with the size problem for very large DAGs.

Being able to CAR a selector should help.

I don’t think it’s all that practical to ensure the format is deterministic without making huge sacrifices in other areas.

Being able to deterministically create a CAR from a CID (or a selector?) is a hard requirement for systems like Filecoin. Allowing non-canonical/non-deterministic CARs is probably fine.

For instance, how does the format ensure the blocks are in a particular order? We could add such a requirement, but then anyone who writes to the format won’t be able to stream to it, they’ll need to prepare all the data some other way and then write blocks in an order that ensured to be compliant with the spec.

Add blocks to the dag in-order, starting at the root, sorted in link order. As far as I know, we already have this feature.

@Stebalien
Copy link
Contributor

Note: if you need a "bag of blocks" format, it might make sense to create two formats:

  1. A bag of blocks format.
  2. A CAR format, built on top of the bag of blocks format, that specifies a specific, deterministic layout.

@hannahhoward
Copy link
Collaborator

Hey IPLDers — we had been assuming in Filecoin land that the CAR would tend toward “simple bag of blocks” and writing the CAR from a root + selector in a deterministic way would be an out of band concern for Filecoin to handle. But that assumes Filecoin does the CARification and the entry point for storing something is a cid + selector & blockstore to read from (which is where we’re headed)

@hannahhoward
Copy link
Collaborator

Part of the reason we were headed this way is selector traversal is itself not deterministic and we didn’t want to impose that requirement on selectors in general even if we were going provide a special case when we write car files. I guess though if none of this is in the CAR spec it better be in the Filecoin spec and currently it’s not :)

@Stebalien
Copy link
Contributor

How are we transferring the data? I assumed we'd be shipping the dag as a bunch of blocks (using, e.g., bitswap), then constructing the CAR on the remote (miner) side. If so, we need to deterministically construct it.

@hannahhoward
Copy link
Collaborator

@Stebalien your understanding of the transfer mechanics is correct and you are right that CARifying absolutely has to be deterministic for filecoin’s case. Our assumption was though because Filecoin would be doing the CARification itself that deterministic construction would be a Filecoin special case. For reference while https://github.com/ipfs/go-car/issues/18 exists my current assumption is we will move this code into Filecoin itself because we may want our writing process to store even more additional information (outside the file itself) like direct byte offsets to a given cid in the car file.

@mikeal
Copy link
Contributor

mikeal commented Jan 6, 2020

Add blocks to the dag in-order, starting at the root, sorted in link order. As far as I know, we already have this feature.

Sure, but that’s not enforced by the format, which is what we’re talking about here. Yes, of course different applications will need to build these in a deterministic way, but it’s a stretch to say that it’s the role of the format to enforce that constraint, which it sounds like you aren’t actually recommending based on “Allowing non-canonical/non-deterministic CARs is probably fine.”

The reason this is an important distinction for us is that we are currently writing tools to prepare data to be stored in Filecoin in .car files and there isn’t really a canonical representation of the data (local to the .car file) because the data is so large that a single file is sharded over multiple .car files, so each .car file doesn’t even have a local DAG to determine “ordering.” In order to understand the ordering you need data that is outside that .car file, which means we wouldn’t be able to build something that could validate any ordering at the format layer.

@Stebalien
Copy link
Contributor

@hannahhoward the spec needs to specify how we go from a CID to a CAR, as far as I know. Otherwise, we're going to have compatibility issues.

@Stebalien
Copy link
Contributor

Sure, but that’s not enforced by the format, which is what we’re talking about here. Yes, of course different applications will need to build these in a deterministic way, but it’s a stretch to say that it’s the role of the format to enforce that constraint, which it sounds like you aren’t actually recommending based on “Allowing non-canonical/non-deterministic CARs is probably fine.”

I've just been saying that there must exist a canonical CAR format. That's the only must.

Additionally, it would be really nice if CARs were completely self-certifying (as anything else is not a Certified ARchive by definition).

@hannahhoward
Copy link
Collaborator

@mikeal did you see my comment above describing the entry point for the filecoin system?

Filecoin does the CARification and the entry point for storing something is a cid + selector & blockstore to read from

Are you expecting to provide car files directly? The current design of the code assumes data is not already put in CAR and filecoin has control over how that happens.

@mikeal
Copy link
Contributor

mikeal commented Jan 7, 2020

@hannahhoward I think we might be talking about different car files being created for different purposes.

We’re preparing large amounts of data to be stored in Filecoin. It’s not really practical for Filecoin to create those .car files, we have to build a lot of custom code in order to take a few PB of data and create the graphs and prepare the .car files, and we have to build all of this to run concurrently because even with 10K running in parallel it takes weeks to prepare the data.

@Stebalien
Copy link
Contributor

@hannahhoward

Off topic: Doesn't graphsync rely on deterministic selector evaluation order for efficient validation?

On topic: How we go from a CID+Selector to a CAR is a critical part of the Filecoin spec. If a client and storage miner don't do the exact same thing, they won't be able to form a deal. If we just say "do whatever the code does" now, we'll have to turn that code into a spec as-is when we finish the spec. That's likely going to lead to a convoluted spec with a lot of edge cases. We can hand-wave network protocols and fix them later but hand-waving piece formation is going to be a problem later.


### Header

The first bytes of the CAR format hold a varint, this unsigned integer specifies the number of bytes beyond the varint itself that contain the _Header_ block. This Header block is a byte array DAG-CBOR (CBOR with tag 42 for CIDs) encoded object holding the version number and array of roots. As an [IPLD Schema](../schemas/):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. Reading the header would be more work (as you would also need to read its CID), but that should be negligible compared to making parsing easier as it's less of a special case.


#### CID

Following the Length, the CID of the block is included in raw byte form. A decoder reading a Section must decode the CID according to CID byte encoding rules, which don't provide a stable length. See https://github.com/multiformats/cid for details on the encoding of a CID. CIDv0 and CIDv1 are currently supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support CIDv0 or do we support it because we can? I don't see a reason to use CIDv0 these days and I'd be happy to phase them out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support CIDv0, we use it everywhere in IPFS. Long ago we could have probably introduced some kind of "use CIDv1 everywhere but always encode DagPB CIDs as CIDv0 when encoding a DagPB block" rule, but that ship has sailed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to create .car files you need some processing anyway. So doing a v0 to v1 conversion shouldn't be that hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that forming a block full of CIDv1 CIDs produces a different CID than a block formed using CIDv0 CIDs. I'm concerned about erasing the CIDv1/v0 information.

But... this does bring up a good point. We might reference the same block as CIDv1 and CIDv0. Given that, I think you're right. It might make sense to specify that CID "headers" (i.e., not internal CIDs), must be normalized to CIDv1.

There's a second (unlikely) issue where multiple blocks might be valid under different codecs. This is one of the reasons we're changing our blockstores to refer to blocks by multihash, instead of by CID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CIDv0 is supported in current CAR code, but iirc the only public use of them is shipping FIL genesis blocks around so I guess it could be broken. Would need some consensus from folks responsible for go-car to make that happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a CIDv1 requirement on the headers we should make sure to document it well and require that CAR readers will accept a CIDv0 as the read input and convert to CIDv1 when looking in the .car file. This implementation detail is likely to be overlooked if not well documented.

@mikeal
Copy link
Contributor

mikeal commented Jan 17, 2020

Apologies for the large comment but this is potentially controversial and I want to make sure I capture all of the details.

I’d like to unwind/challenge the requirements being surfaced regarding determinism. To be clear, I am not questioning the need for determinism in many places where .car files are used, the need for determinism in these use cases is quite clear. What I don’t think is clear, and highly problematic, is trying to achieve and potentially ensure determinism at the .car file rather than by other means. In fact, I would argue that we already are achieving determinism by other means and that determinism is simply carrying over into the .car files we’re creating.

Currently, the .car file library in Go ensures determinism at creation time by performing a deterministic read operation that writes the .car file in a predictable and consistent manner. It can only ensure determinism by performing a deterministic read-on-write and has no way to do this as a write-only operation. So, just to be clear, we do not currently have or rely upon a guarantee than any .car file we receive is deterministic.

Instead, we rely on deterministic creation of .car files by different nodes. If a node were to generate a .car file in an indeterministic manner the operation it was trying to accomplish with other nodes would simply fail.

Another way to describe this behavior that I think could bring us a lot more flexibility and better guarantees is that we have deterministic graph readers. Or, even better, what we have are deterministic selectors.

We only have deterministic .car files to the extent that we have .car files created from deterministic selectors.

Even outside of .car file usage, selectors already must be deterministic in order to work in a trust-less environment because the client receiving the results of a selector must be able to validate the data it is receiving as a stream without excessive buffering, which requires that the ordering be consistent between different implementations. I recall this being part of the selector/graphsync specification process. We’re about to do some more selector spec work in IPLD and we could easily expand and solidify this.

Some problems with trying to place a deterministic requirement on .car files are:

  • Lack of a streaming interface. When we encode data to be added to a .car file we must first encode the entire graph and write it into a blockstore which we then read and perform another write operation on in order to output a deterministic .car file. This is a full write and full read of the entire graph we may otherwise be able to avoid.
  • No way to validate a given .car file is deterministic without a full read.
  • Assuming we go with the recommendations here and ensure that “a .car file can be deterministically re-constructed given the root node” we have an additional problem.
    • We cannot ensure determinism of partial graphs. We can construct selectors for partial graphs but since the requirement is for nothing but a root node to be used in re-construction 🤷‍♂️ Unless we’re willing to write additional data into the .car file about the selector that created a partial graph we won’t ever have a way to validate a .car file without out-of-band information. This is further evidence that we can’t really have deterministic .car files apart from selectors.

By comparison, leaning on deterministic selectors has the following benefits:

  • We can validate determinism block-by-block throughout any read operation.
  • We can re-construct the result of any selector deterministically given the selector.
  • We can ensure these guarantees whenever a read operation is performed.
  • Given a .car file and a selector we can validate the .car file is a deterministic result of the selector.
  • We can create better streaming interfaces for .car file creation.

Finally, it’s worth noting that this would require zero code changes. Like I said, this is already what we’re doing, this is really just about how we talk about determinism and where we try to expand and improve these guarantees in the future.

In my view, it’s actually somewhat dangerous to say we have guarantees that we aren’t validating and could quite easily not be true. In the places I’m aware of in Filecoin where determinism is required for .car files it’s only required for specific operations that could be accomplished by other means and deals could easily be created for indeterministic .car files. This makes me worry about assumptions being made about determinism up the stack that aren’t necessarily true. But everywhere that we use a selector, we can ensure determinism, and if we shift our focus to “determinism via selectors” we won’t find later that we assumed there was a guarantee we don’t actually have.

Tying this all back together, my ideal representation of a “deterministic .car file” is a (selector + car file) in some format. The requirements for determinism should migrate to selectors and we should ensure that current and future selector specifications leave no room for ambiguity on issues like ordering.

If Filecoin, or anyone else, wants a deterministic guarantee they should specify both the selector and the .car file (or some other unique identifier for the .car file like CommP). Which, again, is what we’re already doing in practice.

@whyrusleeping
Copy link
Contributor

@mikeal that makes a large amount of sense to me

@Stebalien
Copy link
Contributor

I agree. The only requirement we have is that there exists a well-defined, deterministic way to go from a selector to a CAR. The CAR format itself doesn't have to require or guarantee any particular order of blocks.

@hannahhoward
Copy link
Collaborator

+1 to what @mikeal is saying.

In terms of future work:

  • Having ways to make selector traversal deterministic is important but probably not a universal requirement for selector traversal for all cases. It seems to me describing some options for deterministic traversal in the selector spec would be helpful, and then we could reference what is used in places where determinism is critical.

For example, I believe the current determinism to be:

  • depth first
  • explore lists in ascending index order
  • explore maps in order of the ordinal value of the fields in the map (tracked internally)

While the above is currently the only traversal type used anywhere, it would be good to have it documented in the selector spec, and referenced in the two places it's vital:

  • Graphsync (should be referenced in graphsync spec probably)
  • Writing CARs in the context of Filecoin (should be referenced in FC spec)

@rvagg
Copy link
Member Author

rvagg commented Jan 20, 2020

I don't want to derail this conversation but I think it's worth noting so we don't leave this snippet as a hand-wave:

explore maps in order of the ordinal value of the fields in the map (tracked internally)

We dove into this discussion today in our meeting, https://github.com/ipld/team-mgmt/blob/8d612e3ce1222de9dd531ca14bdeeb662bd2a3a1/meeting-notes/2020/2020-01-20--ipld-sync.md#notes - map ordering will likely be defined in our strict DAG-* codecs (dag-cbor, dag-json) but will have to be exposed through the encoding in a way that selectors can make use of, which will be interesting in some languages where you can't control natural map ordering (Go for example, but in Go if you want ordering you need a side-channel anyway so this is all being baked into go-ipld-prime anyway).

The best place for further discussion on this, for now, is probably in #227 where I've noted it (it's collecting some CBOR-specific loose vs strict thoughts).

@rvagg
Copy link
Member Author

rvagg commented Jan 21, 2020

I've pushed an update that I hope puts this in a place where it can land, see last commit or read rendered @ https://github.com/ipld/specs/blob/rvagg/car/block-layer/content-addressable-archives.md

Summary:

  • Noted the previously used "Certified ARchive" name, but kept "Content Addressable aRchive" since the move away from strict determinism makes "certified" harder to justify (I think?).
  • Added a section on Determinism which suggests how it might be achieved but basically says it's beyond the scope of this spec
  • Added a collection of "Unresolved Items" at the bottom:
    • Zero roots - the spec says "zero or more" roots in the header but then also says that it's not strictly resolved that zero is OK
    • Zero blocks - likewise the spec says "zero or more blocks", but that's also not strictly resolved
    • CID version - there's discussion above about limiting the CIDs in this format (not inside blocks) to CIDv1, that's unresolved but a possible future part of this spec
    • Root CID block existence - it's not yet specified whether encoders or decoders should validate that a CID in the roots array actually shows up in the archive, this is a nontrivial burden that maybe we're not prepared to dictate
    • Duplicate blocks - like root CID existence, it's not specified whether encoders or decoders should validate the existence of duplicates

If you feel strongly that one of those unresolved items should be resolved then please speak up, otherwise let's treat this as needing final reviews.

@vmx
Copy link
Member

vmx commented Jan 21, 2020

  • CID version - there's discussion above about limiting the CIDs in this format (not inside blocks) to CIDv1, that's unresolved but a possible future part of this spec

I feel strongly about it as I think restricting it to CIDv1 would make things easier and the spec more precise. I also haven't really seen a reason for not restricting it to CIDv1. Though if that again leads to discussions, it could be added in a subsequent PR.

@rvagg
Copy link
Member Author

rvagg commented Jan 22, 2020

re CIDs I'm mostly concerned about it being a one-way operation. If you have a datastore with blocks in it and some are CIDv0, the round-trip through a CAR will mean they all get converted to CIDv1. If they are used as string keys then anything trying to reference a block will have to know that it has to up-convert to CIDv1 to fetch it. Is that going to work? tbh I'm not completely in sync with how this will work in practice and if it'll matter; maybe this up-conversion on look-up is being done actively now anyway?

@mikeal
Copy link
Contributor

mikeal commented Jan 22, 2020

If you have a datastore with blocks in it and some are CIDv0, the round-trip through a CAR will mean they all get converted to CIDv1. If they are used as string keys then anything trying to reference a block will have to know that it has to up-convert to CIDv1 to fetch it. Is that going to work? tbh I'm not completely in sync with how this will work in practice and if it'll matter; maybe this up-conversion on look-up is being done actively now anyway?

The actual block data won’t be modified so this should be fine. Keep in mind, this is essentially a decision the CAR format is making for internal representation and implementations can handle translation of older (or perhaps some day newer) CID versions.

Requiring CIDv1 adds some extra work to CAR implementors but once that work is done there’s no externally visible transformation because:

  • CAR file put() operations will transparently convert v0 to v1.
  • CAR file get() operations will also transparently convert v0 to v1.

This means that when you read block data that references v0 links you can just pass them right off to the CAR file and it’ll pull them out properly. References inside the block to v0 are fine and will remain unchanged. Any external string references to v0 are also fine because they will be converted on get().

The only caveat is that if you parse the entire CAR file’s CID’s and do your own comparison you’ll need to do the conversion yourself as well.

@rvagg
Copy link
Member Author

rvagg commented Jan 22, 2020

Right, my point was that I think existing datastores retain CID version during storage, there's no translation right now is there? For instance, if someone goes to ipfs.io/ipfs/Qm.... then under the hood is it doing a CIDv1 translation of that to look in datastore caches already (same on the client end when doing a DHT query)? If not, then it would seem like the CID bytes are being treated strictly as a key, rather than the content address that it is, so are we ready to be enforcing conversion in all of these other code edges?
I'm fine with changing it if others feel strongly enough about it, I just want to check that we're not putting the cart before the horse in doing so.

@ribasushi
Copy link
Contributor

if someone goes to ipfs.io/ipfs/Qm.... then under the hood is it doing a CIDv1 translation of that to look in datastore caches already

I believe implementing such dual-lookup this is the actual plan, but can not find the issue I've seen that...

@vmx
Copy link
Member

vmx commented Jan 22, 2020

Right, my point was that I think existing datastores retain CID version during storage, there's no translation right now is there?

There are plans that the datastores use only the Multihash and not the full CID: ipfs/js-ipfs#2415.

@mikeal
Copy link
Contributor

mikeal commented Jan 22, 2020

It’s my understanding that part of the transition existing data stores are making from v0 to v1 CID’s is moving the storage layer to use the multihash.

We should consider doing the same in a future version of CAR (for block identifiers, not for root node identifiers) but that is probably too disruptive at the moment.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the "only use CIDv1" would be a quick, easy, non-controversial change that doesn't need much discussions. It turns out this is not true, hence I propose opening a separate issue for it and get this version of the spec merged ASAP.

@mikeal
Copy link
Contributor

mikeal commented Jan 27, 2020

I had a sync up with @Stebalien and @jbenet last Friday and came to a few conclusions.

  1. Let’s back out everything that is incompatible with what has already been serialized as CAR and ship it. We need a spec for the CAR files in the wild we might have to deal with so let’s get it written down.

  2. Let’s start on a minor improvement to the format with a major version bump that includes:

  • v1 CID internals
  • an improved/extended header format that will allow us to include a CBOR selector alongside each root CID. when used, this would establish determinism for the CAR file that can be validated on read.

@rvagg
Copy link
Member Author

rvagg commented Jan 28, 2020

I've done some minor cleanups in the latest commit. I've also changed the roots array from being, in essence, "zero or more, but we haven't strictly resolved that" to being "one or more, but we haven't strictly resolved that". And while I was at it, I did the same for the number of blocks, "one or more blocks" must be present.

We're leaning a lot on an "unresolved issues" section here, so we'll want to focus on those things for an iteration of this spec. (@ribasushi also raised the possibility of identity CIDs with zero-length blocks, but I'll leave that out of the for now for further consideration).

PLEASE REVIEW all, this is intended to reflect current state, a future PR can evolve to a version 2.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that happy with shipping a half-baked format with a spec with things we are pretty sure we want to change in the future. But the spec itself is great, thanks @rvagg!

@mikeal
Copy link
Contributor

mikeal commented Jan 31, 2020

Sorry for the last minute addition.

We should note that not all implementations properly support multiple roots (filecoin) and that it is highly recommended you only use a single root in this version of the spec.

@rvagg
Copy link
Member Author

rvagg commented Feb 3, 2020

@mikeal: see c901366, I've turned the "zero roots" section into a brief rundown of implementations and usage and a recommendation to stick to a single root.

@mikeal
Copy link
Contributor

mikeal commented Feb 3, 2020

looks great, let’s get it merged.

@rvagg rvagg closed this Feb 3, 2020
@rvagg rvagg deleted the rvagg/car branch February 3, 2020 21:12
@vmx
Copy link
Member

vmx commented Feb 12, 2020

I've no idea why this PR doesn't show any merge. But in case someone is wondering, this got merged with commit 71ef2bd.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants