Purpose(s) of caching and the caching model #135

tazjin · 2022-08-29T13:28:58Z

tazjin
Aug 29, 2022

This discussion arose a while ago from some comments on the TVL & Dagger doc. TL;DR, we use Nix and are planning to experiment with compiling our CI pipelines (which currently target Buildkite) to Dagger/Cloak.

In the document we noted that we need a way to mount the folder /nix/store from any host executing build containers into the containers, mutably, at the same location. There was some confusion in the discussion surrounding this about what the /nix/store is, and how it differs from what Dagger does, and I think it's worthwhile to start this discussion around why each of the two systems caches, what they cache and how they determine cache status.

My model of how this works in Dagger/Buildkit might be wrong, so lets discuss!

As a Nixer, I'll start with Nix :)

Caching in Nix

If we zoom out a lot, you can think of Nix as a language for defining filesystem transformations. I'll explain using a simplified model, and use some pseudo-Rust code to try and make this more clear.

Nix's path addressing model

In Nix, the fundamental "building block" is something called a "derivation", which you can broadly speaking think of as a function of this type:

fn derivation(name, inputs: &[Input], instructions: BuildInstructions) -> Output;

Where each Input can be a number of different things, but reducing it to the basics, something like:

enum Input {
  // Input is some files on disk.
  Files(Path),

  // Input is the *output* of another derivation.
  DerivationOutput(StorePath),
}

The BuildInstructions are something like a shell script to execute in the build context (just imagine it is a string).

The Output is something like:

struct Output {
  path: StorePath,
  /* ... maybe some other stuff ... */
}

And a StorePath is simply a string like /nix/store/dxbx340akykivwghyxns5dy9wi98q43s-dagger.

The components of this string have specific meanings:

   /nix/store/dxbx340akykivwghyxns5dy9wi98q43s-dagger
   ^          ^                                ^
   |          |                                |
   |          |                                ----  Human-readable name
   |          |
   |          ---- Hash of all *inputs* used to produce this path.
   |               Note: Usually *not* the hash of the *content*!
   |
   ---- Store path prefix (essentially hardcoded to /nix/store)

What the input hash is produced from depends on the type of input. This isn't actually how it works, but you could imagine that there is a function like this:

fn hash_input(input: Input) -> Hash {
  match input {
    Input::Files(path) => {
      /* code that hashes the file contents themselves */
    }

    Input::DerivationOutput(StorePath) => {
      /* code that returns the hash component of the store path */
    }
  }
}

And then somewhere in our virtual derivation function from above there would be a call to this:

fn hash_derivation(inputs: &[Inputs], instructions: BuildInstructions) -> Hash {
  let mut hash = Hash::new();
  hash.add_hash(hash_build_instructions(instructions));

  for input in inputs {
    hash.add_hash(hash_input(input));
  }

  return hash;
}

Knowing all this, we can now analyse a small snippet of (pseudo-)Nix code:

let
  someDerivation = /* something, not important, returns a derivation */;
in derivation {
  name = "our-little-package";

  inputs = [
    # input the derivation we have from above!
    someDerivation

    # but also input some files from disk!
    ./path/to/package/source
  ];
}

The inputs would construct us a Input::Files(./path/to/package/source) and an Input::DerivationOutput(<StorePath of someDerivation>), which could look like:

   /nix/store/b5qyx763q2q3k4qc5j5wkc6amgslslw3-someDerivation
              ^                                ^
              |                                ---- name of the derivation
              ---- hash of all *inputs* to `someDerivation` (recursion!)

and

    /nix/store/0jdjh31fvwmznc2qwp5mdfw1z88lhf13-source
              ^                                ^
              |                                ---- name of the folder/file/whatever.
              ---- hash of the *contents* of the path they were copied from
                   note: technically, the contents of the path *are* the *inputs* to the path!

It can be assumed that if we follow the dependencies of someDerivation recursively, at some point all leaves of this "input hasing tree" will point to actual file hashes, more or less (in practice inputs end up being all sorts of wild things, like git repositories - where Nix expects the commit and content hashes to be fully pinned. You can't just fetch master!)

(Side note: Another effect of this model is that Nix build outputs are absolutely guaranteed to never overlap! This means we do not have to care about the order in which things are produced, copied, written etc.)

Derivation caching

What follows from this is that Nix can, given some Nix code that contains arbitrarily complex build instructions and dependency graphs, always determine the expected output location! This means that in the Nix caching model, after the language is evaluated, Nix can determine all relevant input/output paths statically.

When it comes to build time (we call this "realisation" of a derivation), Nix can determine whether something needs to be built more or less like this:

/// If this function returns `true`, the build of this store path can be skipped.
fn is_cached(store_path: StorePath) -> bool {
    std::path::Path::new(store_path).exists()
}

In practice there is more logic of course, as a Nix instance might be configured to ask a remote binary cache if it has a given path and download it from there.

The effect of this caching mechanism is that Nix can avoid doing builds if it knows that an equivalent (i.e. repeatably produced from the exact same sources) store path already exists.

Nix caching summary

Why? - Nix caches build outputs to avoid having to do unnecessary rebuilds if an equivalent output already exists.

What? - Nix caches input-hash-addressed filesystem paths. These can be a single file or a tree of files, it does not matter.

How? - Nix enforces that all input hashes to a build are known and prevents builds from randomly interfacing with the outside world. This means that given the same Nix build definitions and source files, Nix can know with certainty whether a build can be skipped and substituted with the output it has already seen.

The Nix caching model does not rely on any state on a builder machine. Caches are reusable between machines executing the same build definitions, and the only effect of losing a cached output is that an equivalent output will be built again.

Note that when using Nix as a package manager, this effect is how binary packages are distributed. Nix the package manager is essentially a "from-source" package manager (think Gentoo) rather than a "binary distribution" package-manager (think dpkg/apt on Debian). It just so happens that if we centrally execute as many of these source builds as we can and make the cache publicly accessible, it automatically becomes a binary distribution in the majority of cases - with no extra work to determine how artifacts are stored and distributed.

Caching in Dagger

The caching model described above for Nix is drastically, but somewhat subtly, different from the caching model of Dagger/Buildkit. While Nix creates a caching model that is reusable between machines, Dagger is primarily interested in caching build steps on a builder machine.

Note: I don't know how the internals of the Dagger/Buildkit caching work, so most of the below is assumptions:

Dagger's cache model

When executing a build described in Dagger, the runtime creates caching steps along the way on the machine executing the build (as each build is essentially like the "layers" produced by a Dockerfile-build). For example:

{
  core {
    git(remote: "https://github.com/dagger/dagger") {
#   ^
#   -- The builder will create a layer containing the outputs of this git clone, and
#      create a cache key for it. When executing a similar build where only the nested
#      instructions change, this outer layer will remain the same.

      dockerbuild(dockerfile: "Dockerfile") {
#     ^
#     -- The builder can determine the cache key of this based on the content hashes of
#        the previous input, and cache-bust if they have changed.

        exec(input: { args: ["dagger", "version"] }) {
#       ^
#       -- The builder can cache this step based on the command provided, but it is unaware
#          of the content of the this step.
          stdout
        }
      }
    }
  }
}

This model has the crucial effect of caching being stateful, i.e.without being able to share the cache between machines, running the exact same build on a different machine with an empty cache can yield a non-equivalent output. Examples of this would be if the HEAD commit of the git repository changed between the builds, or if a command like apt update && apt install ... is executed between updates of the upstream package repository.

Dagger caching summary

Why? - Dagger caches build steps to avoid having to redo potentially expensive work, when reusing a previous artifact is "good enough".

What? - Dagger caches the filesystem layers that were produced by each build step.

How? - Dagger uses all available information from the build instructions to determine cache keys, but does not attempt to guarantee that the exact same builds are repeatable.

In essence, Dagger has a caching model that is much more tailored to caching of command executions where the exact mechanisms of the commands are not very important. For example, the job of a step might be to just move a bunch of files to a different location and this can be cached regardless of which version of the mv or cp tool is used (which would be encoded as part of the hash in Nix).

Summary

Mostly due to these caching model differences, Dagger and Nix - despite some superficial similarities - do not fill the same niche. Nix provides repeatable builds, Dagger provides portable graph execution instructions. Nix can represent things that Dagger can't (e.g. "output equivalence") and Dagger steps can do things that Nix builds can't (e.g. "triggert a deploy by calling an API").

In fact the two systems can work pretty well together. Nix can describe the build structure of everything from a small package build to the build graph of an entire monorepo, and Dagger can be the substrate on which both the pure Nix build steps (i.e. realisation of derivations) as well as any "impure" interfacing steps with the real-world (deploys, artifact publication, ...) can take place.

To make this interaction work well, integrations in either direction must be careful not to assume the same caching model on the other side. For example, a Dagger build graph generated from Nix must ensure that it can force the reexecution of steps that it knows have changed, and Dagger must be able to allow Nix to efficiently cache using its own model across its execution boundaries (e.g. a bind-mounted, mutable Nix-store).

Open for discussions, questions, whatever!

sagikazarmark · 2022-08-29T14:23:11Z

sagikazarmark
Aug 29, 2022

I'm not 100% sure that the Dagger caching part is accurate.

First off, it's not really Dagger, but Buildkit doing the caching.

Secondly, I don't think the claim about "reusing cache on the same machine" is true. In fact, there has been lots of improvements in how caching works in buildkit (for example, it can directly talk to GitHub's caching API for Actions, although I've had a lot of issues with that implementation; it can also push cache layers to remote registries for reuse between builds/machines).

I'm not an expert in how Buildkit caching works either, but my understanding is that as long as a layer doesn't change (which is subject of the build instruction in a Dockerfile and the entire context (files, build args etc passed to buildkit)), Buildkit can reuse it during a build.

The major difference I see is while Nix knows what needs building upfront, Buildkit needs to go through the build process the same way and verify whether a layer can be reused from the cache or needs to be rebuilt. Also, every subsequent layer will be rebuilt if Buildkit gets a cache missed for any of the layers.

There is also another feature in Buildkit which allows you to mount volumes during a build. It's useful for reusing caches that you don't want in the image after the build (eg. package manager caches).

0 replies

vito · 2022-09-13T22:24:58Z

vito
Sep 13, 2022

My $.02 based on my experience using Nix with Bass (which uses Buildkit under the hood, same as Dagger): I think your comparison reflects a difference in defaults/expectations regarding hermetic builds between the two ecosystems, but I don't think it's an accurate assessment of the technical underpinnings of caching in Buildkit, which are far closer to Nix than you might expect.

For this comparison I'll focus on Buildkit since I don't yet know Dagger fully and as far as I know Dagger doesn't intend to represent things any differently.

Caveat 1: I'm the opposite, with my background being mainly Buildkit. I'm still learning Nix, and I'm using NixOS as my daily driver as of a few weeks ago.

Caveat 2: I'm only really familiar with Buildkit from an API consumer's perspective, so this is all based on observations made while building Bass.

Happy to be corrected on any of this from parties that know more about either. 😄

Nix is a specialization of Buildkit

Mostly tongue-in-cheek clickbait, but it seems to me that Buildkit could be used to implement something much like Nix (at least, the building/caching parts; it couldn't be a host OS).

Here are my justifications for this mental arrangement. Note that this isn't qualitative assessment; I'm just looking at it through a lens of "is X a Y or is Y a X?"

Buildkit requires you to bring-your-own-rootfs, whereas Nix very much wants to be the rootfs, ideally the only rootfs to maximize /nix/store caching.
Buildkit lets you produce multiple outputs (mounts) from a single operation, whereas Nix derivations only produce one output.
Buildkit allows you to disable network access for individual Run ops, whereas Nix builds have it disabled by default.
Buildkit allows you to fetch an external versioned + checksummed input as a first-class operation, which is a primitive Nix requires given the above.

Buildkit is just a toolkit. It doesn't enforce the same strong opinions as Nix because its initial intended use case is Dockerfiles which are rife with anti-patterns from the perspective of reproducible builds. But Dockerfiles are just one use case. Buildkit's internal API is much more powerful and provides a pretty solid foundation for reproducible builds.

Where I suspect Dagger and Nix might differ in the long run is Dagger might remain willing to compromise and not require users to make everything reproducible for the sake of having a simpler onboarding experience. I don't really know. But I think we should absolutely support reproducible builds if it's not the plan already.

Input precision

The fundamental building block of Buildkit's API is a LLB operation. LLB operations are bundled together to form a LLB definition, which is very similar to a Nix derivation: it describes a result and it contains all of its inputs recursively along with the relationships between them, forming a DAG (hence DAGger).

Buildkit caches by the hash of each operation within the LLB definition. Caches are re-used across different LLB definitions that embed them. Concurrent LLB solve() API calls will synchronize such that there will be no duplicate work for each individual operation. Buildkit will automatically multiplex build output progress over the API. Buildkit also tracks the usage (time + count) for cache hits which you can use to prioritize pruning.

While Nix forces all inputs to be precise, Buildkit merely works best if all inputs are precise.

If you're aiming for reproducible builds it doesn't make sense to run a naked git clone, just as it doesn't make sense to do that in Nix. But even if you're not aiming for reproducible builds, you probably don't want to do that; not only will the result be different across machines, it will be cached forever on each machine. Since you're not specifying a version, I'd assume you'd always want the latest, so why would you want it to be cached forever? The same could be argued for apt-get for security updates/etc.

At first glance, the first difference here is that Nix forces you to be precise by disabling the network during a build. To fetch something in Nix you have to use primitives for fetching from Git/HTTP, and those primitives require the user to be precise with versions, checksums, etc.

with import <nixpkgs> {};
stdenv.mkDerivation {
  name = "treason";
  builder = pkgs.runCommand "yolo" {} ''
    ${pkgs.git}/bin/git clone https://github.com/vito/bass
  '';
}

❯ nix-build treason.nix
these 2 derivations will be built:
  /nix/store/4mwrz0kn1hiq3zr99yslnl888rlpp1bq-yolo.drv
  /nix/store/pnys2brnvsajq6fxby7rkg6klqciak1i-light-treason.drv
building '/nix/store/4mwrz0kn1hiq3zr99yslnl888rlpp1bq-yolo.drv'...
Cloning into '/nix/store/n370r05rvz896h9nc3zxcn2324n6h9qm-yolo'...
fatal: unable to access 'https://github.com/vito/bass/': Could not resolve host: github.com
error: builder for '/nix/store/4mwrz0kn1hiq3zr99yslnl888rlpp1bq-yolo.drv' failed with exit code 128;
       last 2 log lines:
       > Cloning into '/nix/store/n370r05rvz896h9nc3zxcn2324n6h9qm-yolo'...
       > fatal: unable to access 'https://github.com/vito/bass/': Could not resolve host: github.com
       For full logs, run 'nix log /nix/store/4mwrz0kn1hiq3zr99yslnl888rlpp1bq-yolo.drv'.
error: 1 dependencies of derivation '/nix/store/pnys2brnvsajq6fxby7rkg6klqciak1i-light-treason.drv' failed to build

But this is just an opinion, and one that Buildkit is also capable of representing. It's possible to set a command's network to NONE and instead use the Git source or the HTTP source to fetch inputs to your build. (Side note: if you don't specify a ref, or specify a branch instead, Buildkit will always resolve it to a sha to ensure the cache is precise.)

Disabling network access by default would obviously be a major disruption to the Docker community, but it's something Dagger or any platform on top of it could support.

Strictly speaking, afaik there's nothing Nix can do to prevent a derivation from yielding a different result on every run either. It's just running commands at the end of the day; there are other sources of randomness. It does timestamp normalization, which goes a long way, but you can do that with Buildkit too: Bass normalizes outputs to 1985 prior to snapshotting.

In any case, Buildkit's default caching behavior at least encourages you to be precise so your caches don't grow stale. It isn't a strong guarantee, but platforms built on top of it can still try to enforce it.

Host filesystem isolation

The most foundational difference from Nix with Buildkit is that every command you run comes with one extra input: its root filesystem.

Nix derivations write to the host filesystem in one big monolithic /nix/store/ hierarchy. Nix doesn't really convey a notion of an "initial root filesystem" - /nix/store just contains the union of all derivations that are required (whether that's an individual derivation's inputs or your entire NixOS). Nix eschews the global FHS where possible in favor of using explicit absolute paths between all dependencies to avoid conflicts. It then uses a bit of glue to put things in the right place (Nix profiles I think on NixOS?) so that common tooling can find executables and dependencies.

Buildkit also stores snapshots in one big monolithic "snapshots" directory, but the hierarchy is less flat, since it also contains entire root filesystems. Its snapshot filesystem hierarchy isn't directly content-addressed; it has a database instead and tracks snapshots by a numeric ID, managed by the Buildkit daemon and presumably keyed on the digest of the operation that created it. I wouldn't expect to ever look at this path, just as I wouldn't expect to interact with /nix/store directly.

In practice, requiring the user to define a root filesystem is what leads to a lot of non-reproducible builds. Buildkit doesn't provide a safe way to fetch the packages you need; that would mean "blessing" a particular package repository, which is a tall order for a generic toolkit. So lots of folks just do what they know: YOLO from Ubuntu or Alpine and apt-get install or apk add.

This is where I think Buildkit stands to benefit the most from Nix - or more specifically, nixpkgs. If Buildkit had some way to pull in packages from a particular version of nixpkgs and turn that into a FHS layout as an LLB operation, everything would be great! But I don't know if that's in the cards. Until then Nixery gets us pretty close!

Since you can also use any old mount point as the rootfs input for a Run operation, you can also build a rootfs directory using whatever tool you want. You just need to solve the bootstrapping problem of fetching said tool.

The latter two options are where I landed with Bass. It's possible to run Nixery in Buildkit and use it to fetch images for a later thunk, and it's also possible to use nix build to build an OCI archive used by a thunk.

So it turns out the solution to reproducible builds with Buildkit is to use Nix. 😜

Side note: it's also possible to use a persistent cache for reusing a local /nix/store within Buildkit, but it's not currently possible to have them all share the /nix/store from the host.

Multiple outputs

Each input mount point, including the rootfs, has a corresponding output: a snapshot of changes made to the mount point. When you chain Run LLB operations you propagate the rootfs mount from one to the next, building up a bunch of intermediate snapshots that get layered on top of each other.

But this applies to all the other mount points, too. Bass, for example, has each thunk produce its own output directory independent of the rest of the rootfs. Thunks are effectively a content-addressed output directly (and now network address), much more similar to Nix derivations since they're just an output directory mount rather than a full filesystem snapshot. But under the hood they just compile to an LLB definition. The important thing here is that Bass makes use of both, since it propagates both the output directory and the rootfs output from a parent thunk to the child thunk.

fin

This ended up being a lot of words, but I wanted to show how similar these two systems really are. It would be awesome if we could unearth opportunities to make them work nicely together, since I think it could really be mutually beneficial!

2 replies

sipsma Sep 14, 2022
Maintainer

+1 to everything here

Side note: it's also possible to use a persistent cache for reusing a local /nix/store within Buildkit, but it's not currently possible to have them all share the /nix/store from the host.

Upstream changes to buildkit may be able to address this (issue in buildkit for this here).

But in the meantime I am curious how firm of a requirement it is that /nix/store actually be bind-mounted from the host. Would it also be okay if /nix/store just lives inside of buildkit's state (as a cache dir, like Alex mentioned)? That would make it available and shared in all buildkit executions, but it would not be available directly on the host. Is that okay, or is there a firm requirement for it to be available on the host?

vito Sep 14, 2022

It's not a hard requirement as far as I can tell; I've been able to use a /nix/ mount entirely within Buildkit by using a cache mount keyed by the image digest, seeding it with the initial /nix/ content, and then mounting it over /nix/ from then on. The relevant code for that is here.

It seems to work just fine, but it's a little clunky. moby/buildkit#1545 could improve things since you wouldn't need to do the separate cp step.

Another quirk is synchronizing access. Bass is very conservative at the moment and just locks all access to the same cache, forcing builds to be synchronous. I'm not sure how this normally works in Nix; if it requires coordinating via the nix-daemon it might not work (I'm not sure if it even runs one in the Docker image?), but if it's all filesystem-based locking or something it might be possible to make them all shared-writeable. 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purpose(s) of caching and the caching model #135

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Purpose(s) of caching and the caching model #135

tazjin Aug 29, 2022

Caching in Nix

Nix's path addressing model

Derivation caching

Nix caching summary

Caching in Dagger

Dagger's cache model

Dagger caching summary

Summary

Replies: 2 comments · 2 replies

sagikazarmark Aug 29, 2022

vito Sep 13, 2022

Nix is a specialization of Buildkit

Input precision

Host filesystem isolation

Multiple outputs

fin

sipsma Sep 14, 2022 Maintainer

vito Sep 14, 2022

tazjin
Aug 29, 2022

Replies: 2 comments 2 replies

sagikazarmark
Aug 29, 2022

vito
Sep 13, 2022

sipsma Sep 14, 2022
Maintainer