Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-402: Partial CAR Support on Trustless Gateways #402

Merged
merged 21 commits into from
Jul 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions ipip-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,6 @@ interoperable implementations.
When modifying an existing specification file, this section should provide a
summary of changes. When adding new specification files, list all of them.

## Test fixtures

List relevant CIDs. Describe how implementations can use them to determine
specification compliance. This section can be skipped if IPIP does not deal
with the way IPFS handles content-addressed data, or the modified specification
file already includes this information.

## Design rationale

The rationale fleshes out the specification by describing what motivated
Expand All @@ -67,6 +60,13 @@ Explain the security implications/considerations relevant to the proposed change

Describe alternate designs that were considered and related work.

## Test fixtures

List relevant CIDs. Describe how implementations can use them to determine
specification compliance. This section can be skipped if IPIP does not deal
with the way IPFS handles content-addressed data, or the modified specification
file already includes this information.

### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
19 changes: 13 additions & 6 deletions src/http-gateways/path-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ editors:
url: https://hacdias.com/
xref:
- url
- trustless-gateway
tags: ['httpGateways', 'lowLevelHttpGateways']
order: 0
---
Expand Down Expand Up @@ -214,11 +215,13 @@ These are the equivalents:
- `format=cbor` → `Accept: application/cbor`
- `format=ipns-record` → `Accept: application/vnd.ipfs.ipns-record`

<!-- TODO Planned: https://github.com/ipfs/go-ipfs/issues/8769
- `selector=<cid>` can be used for passing a CID with [IPLD selector](https://ipld.io/specs/selectors)
- Selector should be in dag-json or dag-cbor format
- This is a powerful primitive that allows for fetching subsets of data in specific order, either as raw bytes, or a CAR stream. Think “HTTP range requests”, but for IPLD, and more powerful.
-->
### `dag-scope` (request query parameter)

Only used on CAR requests, same as :ref[dag-scope] from :cite[trustless-gateway].

### `entity-bytes` (request query parameter)

Only used on CAR requests, same as :ref[entity-bytes] from :cite[trustless-gateway].

# HTTP Response

Expand Down Expand Up @@ -592,7 +595,11 @@ The following response types require an explicit opt-in, can only be requested w
- Raw Block (`?format=raw`)
- Opaque bytes, see [application/vnd.ipld.raw](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw).
- CAR (`?format=car`)
lidel marked this conversation as resolved.
Show resolved Hide resolved
- Arbitrary DAG as a verifiable CAR file or a stream, see [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car).
- A CAR file or a stream that contains all blocks required to trustlessly verify the requested content path query, see [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car) and :cite[trustless-gateway].
- **Note:** by default, block order in CAR response is not deterministic,
blocks can be returned in different order, depending on implementation
choices (traversal, speed at which blocks arrive from the network, etc).
An opt-in ordered CAR responses MAY be introduced in a future IPIP.
- TAR (`?format=tar`)
- Deserialized UnixFS files and directories as a TAR file or a stream, see :cite[ipip-0288].
- IPNS Record
Expand Down
190 changes: 180 additions & 10 deletions src/http-gateways/trustless-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: >
Trustless Gateways are a minimal subset of Path Gateways that allow light IPFS
clients to retrieve data behind a CID and verify its integrity without delegating any
trust to the gateway itself.
date: 2023-03-30
date: 2023-06-20
maturity: reliable
editors:
- name: Marcin Rataj
Expand All @@ -17,25 +17,33 @@ tags: ['httpGateways', 'lowLevelHttpGateways']
order: 1
---

Trustless Gateway is a minimal _subset_ of :cite[path-gateway]
Trustless Gateway is a _subset_ of :cite[path-gateway]
that allows light IPFS clients to retrieve data behind a CID and verify its
integrity without delegating any trust to the gateway itself.

The minimal implementation means:

- data is requested by CID, only supported path is `/ipfs/{cid}`
- no path traversal or recursive resolution, no UnixFS/IPLD decoding server-side
- response type is always fully verifiable: client can decide between a raw block or a CAR stream
- no UnixFS/IPLD deserialization
- for CAR files:
- the behavior is identical to :cite[path-gateway]
- for raw blocks:
- data is requested by CID, only supported path is `/ipfs/{cid}`
- no path traversal or recursive resolution

# HTTP API

A subset of "HTTP API" of :cite[path-gateway].

## `GET /ipfs/{cid}[?{params}]`
## `GET /ipfs/{cid}[/{path}][?{params}]`

Downloads data at specified CID.
Downloads verifiable data for the specified **immutable** content path.

## `HEAD /ipfs/{cid}[?{params}]`
Optional `path` is permitted for requests that specify CAR format (`application/vnd.ipld.car`).

For RAW requests, only `GET /ipfs/{cid}[?{params}]` is supported.

## `HEAD /ipfs/{cid}[/{path}][?{params}]`

Same as GET, but does not return any payload.

Expand All @@ -45,13 +53,13 @@ Downloads data at specified IPNS Key. Verifiable :cite[ipns-record] can be reque

## `HEAD /ipns/{key}[?{params}]`

same as GET, but does not return any payload.
Same as GET, but does not return any payload.

lidel marked this conversation as resolved.
Show resolved Hide resolved
# HTTP Request

Same as in :cite[path-gateway], but with limited number of supported response types.

## HTTP Request Headers
## Request Headers

### `Accept` (request header)

Expand All @@ -67,12 +75,174 @@ Below response types SHOULD to be supported:
Gateway SHOULD return HTTP 400 Bad Request when running in strict trustless
mode (no deserialized responses) and `Accept` header is missing.

## Request Query Parameters

### :dfn[dag-scope] (request query parameter)

Optional, `dag-scope=(block|entity|all)` with default value `all`, only available for CAR requests.

Describes the shape of the DAG fetched the terminus of the specified path whose blocks
are included in the returned CAR file after the blocks required to traverse
path segments.

- `block` - Only the root block at the end of the path is returned after blocks
required to verify the specified path segments.

- `entity` - For queries that traverse UnixFS data, `entity` roughly means return
blocks needed to verify the terminating element of the requested content path.
For UnixFS, all the blocks needed to read an entire UnixFS file, or enumerate a UnixFS directory.
For all queries that reference non-UnixFS data, `entity` is equivalent to `block`

- `all` - Transmit the entire contiguous DAG that begins at the end of the path
query, after blocks required to verify path segments

When present, returned `Etag` must include unique prefix based on the passed scope type.

### :dfn[entity-bytes] (request query parameter)

The optional `entity-bytes=from:to` parameter is available only for CAR
requests.

It implies `dag-scope=entity` and serves as a trustless equivalent of an HTTP
Range Request.

When the terminating entity at the end of the specified content path:

- can be interpreted as a continuous array of bytes (such as a UnixFS file), a
Gateway MUST return only the minimal set of blocks necessary to verify the
specified byte range of that entity.

- When dealing with a sharded UnixFS file (`dag-pb`, `0x70`) and a non-zero
`from` value, the UnixFS data and `blocksizes` determine the
corresponding starting block for a given `from` offset.

- cannot be interpreted as a continuous array of bytes (such as a DAG-CBOR/JSON
map or UnixFS directory), the parameter MUST be ignored, and the request is
equivalent to `dag-scope=entity`.

Allowed values for `from` and `to` follow a subset of section 14.1.2 from
:cite[rfc9110], where they are defined as offset integers that limit the
returned blocks to only those necessary to satisfy the range `[from,to]`:

- `from` value gives the byte-offset of the first byte in a range.
- `to` value gives the byte-offset of the last byte in the range;
that is, the byte positions specified are inclusive.

The following additional values are supported:

- `*` can be substituted for end-of-file
- `entity-bytes=0:*` is the entire file (a verifiable version of HTTP request for `Range: 0-`)
- Negative numbers can be used for referring to bytes from the end of a file
lidel marked this conversation as resolved.
Show resolved Hide resolved
- `entity-bytes=-1024:*` is the last 1024 bytes of a file
(verifiable version of HTTP request for `Range: -1024`)
- It is also permissible (unlike with HTTP Range Requests) to ask for the
range of 500 bytes from the beginning of the file to 1000 bytes from the
end: `entity-bytes=499:-1000`

Copy link
Member

@rvagg rvagg May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the entity is a sharded UnixFS file, and the `from` value is _not_ `0`, the
claimed encoded size (`Tsize`) values of the individual shards that make up the
file will be implicitly trusted to determine the block that contains that
offset. Because this cannot be fully verified for correctness without having all
blocks from the start of the file content, consumers of this data must trust
that the producer of the data has properly encoded the blocks of the UnixFS
sharded file. If a stronger guarantees are required, byte ranges should be
avoided, or should be anchored with a `from` of `0`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh look! Real-world use proving this problem: ipfs/js-ipfs-unixfs#335

Copy link
Contributor

@aschmahmann aschmahmann May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg dag-pb TSize has nothing to do with entity-bytes at all. The UnixFS filesize and blocksizes (protobuf elements 3 and 4 of the Data field of the dag-pb nodes for UnixFS files) does.

AFAIK the only thing TSize is potentially used by for anyone in the context of the HTTP Gateway API is in the context of the trusted component of the API and giving rough estimates of file/directory sizes in directory HTMLs when the gateway doesn't want to use resources to grab the first block of each directory element to figure out if it's a file/directory and for files what the filesize is. However, what goes into a directory HTML file is non-normative (see

## Generated HTML with directory index
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, different value, same problem though; the value is encoded by the producer of the blocks and has to be implicitly trusted to get the right byte range

Copy link
Contributor

@aschmahmann aschmahmann May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the issue with malformed graphs still exists along with the trust problems you mentioned. It shows up here, but it also likely belongs as an implementers note on the trusted gateway spec as well.

I was just calling out that:

  1. TSize is the wrong field
  2. Unlike TSize filesize and blocksizes have actual definitions which means when they're not obeyed the UnixFS graph is malformed unlike the argument about TSize where people seem more hesitant to give it an actual definition (although it seems to have been defined as cumulative graph size since day 1 ipfs/go-merkledag@63e4477#diff-10837cc3557cec0045183193a03f17af589591f2be0753262027534bb8f64ad9R38-R39)

Edit: Also, IIUC because filesize and blocksizes have actual definitions implementations don' t actually have to fall into this trap (even if common ones today do). As long as ReadAll(<UnixFS file cid>)[X:Y] matches Read(<UnixFS file cid>, X,Y) this problem goes away. So this isn't really a spec issue here, or necessarily in UnixFS (although notes to implementers seem reasonable), since these issues only happen with malformed UnixFS data.

Copy link
Member

@lidel lidel Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added filesize / blocksizes note in 0953cb6, but agree, the implicit trust in DAG being correctly created is a concern beyond this IPIP or even this API.

It is the same class of problems as "I do trust data match the CID i requested, but where do I get trusted CIDs from?" – IMO beyond the scope of this IPIP.

A Gateway MUST augment the returned `Etag` based on the passed `entity-bytes`.

A Gateway SHOULD return an HTTP 400 Bad Request error when the requested range
cannot be parsed as valid offset positions.

In more nuanced error scenarios, a Gateway MUST return a valid CAR response
that includes enough blocks for the client to understand why the requested
`entity-bytes` was incorrect or why only a part of the requested byte range was
returned:
lidel marked this conversation as resolved.
Show resolved Hide resolved

- If the requested `entity-bytes` resolves to a range that partially falls
outside of the entity's byte range, the response MUST include the subset of
blocks within the entity's bytes.
- This allows clients to request valid ranges of the entity without needing
to know its total size beforehand, and it does not require the Gateway to
buffer the entire entity before returning the response.

- If the requested `entity-bytes` resolves to a zero-length range or falls
fully outside of the entity's bytes, the response is equivalent to
`dag-scope=block`.
- This allows client to produce a meaningful error (e.g, in case of UnixFS,
leverage `Data.blocksizes` information present in the root `dag-pb` block).

- In streaming scenarios, if a Gateway is capable of returning the root block
but lacks prior knowledge of the final component of the requested content
path being invalid or absent in the DAG, a Gateway SHOULD respond with HTTP 200.
- This behavior is a consequence of HTTP streaming limitations: blocks are
not buffered, by the time a related parent block is being parsed and
returned to the client, the HTTP status code has already been sent to the
client.

# HTTP Response

Below MUST be implemented **in addition** to "HTTP Response" of :cite[path-gateway].

## HTTP Response Headers
## Response Headers

### `Content-Type` (response header)

MUST be returned and include additional format-specific parameters when possible.

If a CAR stream was requested, the response MUST include the parameter specifying CAR version.
For example: `Content-Type: application/vnd.ipld.car; version=1`

### `Content-Disposition` (response header)

MUST be returned and set to `attachment` to ensure requested bytes are not rendered by a web browser.

## Response Payload

### Block Response

An opaque bytes matching the requested block CID
([application/vnd.ipld.raw](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw)).

The Body hash MUST match the Multihash from the requested CID.

### CAR Response

A CAR stream for the requested
[application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car)
content type, path and optional `dag-scope` and `entity-bytes` URL parameters.

#### CAR version

Value returned in
[`CarV1Header.version`](https://ipld.io/specs/transport/car/carv1/#header)
field MUST match the `version` parameter returned in `Content-Type` header.

#### CAR roots

The behavior associated with the
[`CarV1Header.roots`](https://ipld.io/specs/transport/car/carv1/#header) field
is not currently specified.

Clients MAY ignore it.

:::issue

As of 2023-06-20, the behavior of the `roots` CAR field remains an [unresolved item within the CARv1 specification](https://web.archive.org/web/20230328013837/https://ipld.io/specs/transport/car/carv1/#unresolved-items).

:::

#### CAR determinism

The default CAR header and block order in a CAR response is not specified and is non-deterministic.

Clients MUST NOT assume that CAR responses are deterministic (byte-for-byte identical) across different gateways.

Clients MUST NOT assume that CAR includes CIDs and their blocks in the same order across different gateways.

:::issue

In controlled environments, clients MAY choose to rely on undocumented CAR determinism,
subject to the agreement of the following conditions between the client and the
gateway:
- CAR version
- content of [`CarV1Header.roots`](https://ipld.io/specs/transport/car/carv1/#header) field
- order of blocks
- status of duplicate blocks

In the future, there may be an introduction of a convention to indicate aspects
of determinism in CAR responses. Please refer to
[IPIP-412](https://github.com/ipfs/specs/pull/412) for potential developments
in this area.

:::
Loading