Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildah 1.38.1 build "FROM oci-archive:" failed #5952

Open
henrywang opened this issue Jan 29, 2025 · 23 comments
Open

buildah 1.38.1 build "FROM oci-archive:" failed #5952

henrywang opened this issue Jan 29, 2025 · 23 comments
Assignees

Comments

@henrywang
Copy link

quay.io/buildah/stable:v1.38.1 failed to build image with FROM oci-archive:./out.ociarchive and got error:

[2/2] STEP 1/3: FROM oci-archive:./out.ociarchive
Error: creating build container: creating temp directory: archive file not found: "/builds/redhat/rhel/bifrost/compose-bootc/out.ociarchive"

quay.io/buildah/stable:v1.38.0 does not have this issue.

@vrothberg
Copy link
Member

vrothberg commented Jan 29, 2025

Detected in the Image Mode pipelines which cannot use v1.38.1 for now.

https://gitlab.com/redhat/rhel/bifrost/compose-bootc/-/merge_requests/208/diffs

@vrothberg
Copy link
Member

Another MR only bumping buildah fails as well.

@vrothberg
Copy link
Member

@henrywang do we have an easy reproducer?

@cgwalters
Copy link

cgwalters commented Jan 29, 2025

I think this is a regression from 25a3b38

This feature is really key as a way to generate chunked/reproducible containers and we've been using it for quite some time for rpm-ostree, and we have plans in the immediate future to promote this more for custom base images (ref https://gitlab.com/fedora/bootc/tracker/-/issues/32 and coreos/rpm-ostree#5221 specifically)

Way back in coreos/rpm-ostree#4688 (comment) there was a soft commitment to support this.

My understanding is that there are security implications (not mentioned in the commit message) that motivated 25a3b38

Can we elaborate on those now that the commit is public (I don't see a corresponding PR with discussion)?

It seems to me that what we always wanted here is a way to write not to the build context, but just any internal filesystem (a tmpfs or a tempdir bind mount) whose lifetime is scoped to the entire build, right?

@mheon
Copy link
Member

mheon commented Jan 29, 2025

The relevant CVE is GHSA-5vpc-35f4-r8w6

A core part of this was the ability to write to the host filesystem from mounts, which is why it was disabled - so what you desire is a large part of a high-severity CVE. I don't see us turning this back on. Adding an option with the understanding that enabling it is explicitly insecure and exposes the host to potential escape by malicious Dockerfile is a potential option, but that also feels very bad.

@cgwalters
Copy link

@henrywang do we have an easy reproducer?

Here's a trivial example of a "build" that actually just copies from a registry into a local oci: directory via skopeo - but the point is that an "oci" directory is something that any tool running in the container build can synthesize however it wants - without e.g. being subject to injected floating-timestamp content from buildah itself etc. but also more generally having total power over structuring the layout of the target container image.

FROM registry.access.redhat.com/ubi9/ubi:latest  as builder
RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared \
  dnf -y install skopeo && skopeo copy docker://busybox oci:/buildcontext/out.oci

FROM oci:./out.oci
# Need to reference builder here to force ordering. But since we have to run
# something anyway, we might as well cleanup after ourselves.
RUN --mount=type=bind,from=builder,src=.,target=/var/tmp \
    --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared \
      rm /buildcontext/out.oci -rf

@cgwalters
Copy link

Adding an option with the understanding that enabling it is explicitly insecure and exposes the host to potential escape by malicious Dockerfile is a potential option, but that also feels very bad.

Yes, agree adding an option isn't what we want here. Let's instead dig more into the final paragraph of my previous comment:

It seems to me that what we always wanted here is a way to write not to the build context, but just any internal filesystem (a tmpfs or a tempdir bind mount) whose lifetime is scoped to the entire build, right?

To flesh this out slightly...maybe we could use RUN --mount=type=cache for this since it's already defined to be a writable, persistent-across-run-invocations area. IOW we change to:

FROM registry.access.redhat.com/ubi9/ubi:latest  as builder
RUN --mount=type=cache,sharing=private,dst=/output \
  dnf -y install skopeo && skopeo copy docker://busybox oci:/output/out.oci

# Here I made up the syntax //<cachename>/ to reference a cache
FROM oci://output/out.oci

Note if we take this step and formalize things more it actually gets a lot cleaner as the builder can know the ordering here and we can declare that the cached OCI directory is automatically consumed (since why would you leave it around)?

@nalind
Copy link
Member

nalind commented Jan 29, 2025

Note if we take this step and formalize things more it actually gets a lot cleaner as the builder can know the ordering here and we can declare that the cached OCI directory is automatically consumed (since why would you leave it around)?

Caches are not scoped to the lifetime of a given build.

@cgwalters
Copy link

Caches are not scoped to the lifetime of a given build.

Yes I know - its lifetime is longer - which isn't needed here, but should still work. In my proposal you'd just end up with an empty directory/image as a cache; which actually we could detect that's the case and just discard.

Do you have any alternative proposals?

@cgwalters
Copy link

Maybe another possible fix would be to expand the lifetime of the overlayfs created from a RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared to the entire build, instead of each RUN invocation?

I guess though a downside of this might be that it's different from what Docker is doing today, which could confuse people.

But at the cost of getting more hacky, we could do that scope expansion only if we detect there's a FROM oci: in the build?

@cgwalters
Copy link

I should emphasize here though again: our production builds use this feature today. Having a working version of this (and really ideally compatible with the existing Containerfile, although that wouldn't be a requirement) as relatively soon as possible (e.g. < 1 month) would be quite helpful.

I think in the short term we can just hold off on applying the buildah update, but that's obviously not sustainable. Among other things, a big part of the entire point of this is that we can tell people the base image build is just "podman build" today, and after updating that no longer works.

@cgwalters
Copy link

cgwalters commented Jan 29, 2025

Maybe another possible fix would be to expand the lifetime of the overlayfs created from a RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared to the entire build, instead of each RUN invocation?

It looks like bind-propagatation is a buildah-specific extension not supported by Docker; so perhaps we could also key off that as a lifetime extension hint? Though when I test things out, a build works without it - so either it was never required or something else changed in buildah.

I guess the more I think about the more I'd say:

  • Short term, detect FROM oci in the image and lifetime expand any RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared to the entire build (mounts identified by source+destination pair)
  • Medium term, try to come up with a design that's cleaner and can ideally be added to docker (and other builders) as well, for which I think I like the FROM oci://output/out.oci or FROM --mount=type=cache,target=/output oci:/output/out.oci or so.

@nalind
Copy link
Member

nalind commented Jan 29, 2025

I would have (and have previously) suggested solving this at the pipeline level.

Anyway, the reason the FROM fails is that the OCI archive is no longer being written to the actual build context directory by the previous stage, so it's literally not where the FROM instruction says it is. Unless it's a remote build being done via podman, a -v flag at the command line can be used instead of a --mount=type=bind in the Containerfile to expose the build context directory to RUN instructions.

When we teach the --build-context command line flag about OCI layouts (and possibly archives), the value on the FROM line will be able to be replaced by the nickname given as part of the flag's argument. That'll back us off just a bit further away from having custom behavior in Containerfiles, which is a direction I've come to generally try to discourage because the target we're trying to be compatible with moves.

@nalind
Copy link
Member

nalind commented Jan 29, 2025

It looks like bind-propagatation is a buildah-specific extension not supported by Docker

It looks like that flag is there because the set of accepted flags for buildah run --mount was modeled after podman run's --mount flag, and the logic for parsing its key-value arguments list was reused when it was later implemented for RUN in Containerfiles.

@cgwalters
Copy link

I would have (and have previously) suggested solving this at the pipeline level.

Can you elaborate on that? Previously in this comment you said:

and I'm not sure we want to sign up for that when the standard advice for jobs that can't be done with a Dockerfile is to use buildah's other commands.

In what we're doing here we aren't using buildah in any way because we want fine-grained control over the bit for bit output of the OCI archive.

But your point more generally here I would phrase as "don't try to do it via Containerfile" - which definitely makes sense as a starting point, except I'd just reiterate that going from "podman build" to "run this script/Makefile" is a large leap.

Actually today e.g. on MacOS (or more generally with podman-machine) this trick doesn't work; I didn't dig in but I think it has to do with us always copying the build context with podman-remote. But in theory, it could work on podman-remote/podman-machine cases, which is quite important because there's lots of infrastructure built on top of "podman-build". If we have a custom script/pipeline, then especially in podman-remote cases we'd have to do some careful juggling to make sure our build works there.

Hmm...I guess in theory we could implement our build process as "spawn a container image with write access to containers-storage:" (or if one wants to isolate it more, "spawn a container imgae with write access to an empty tempdir, which at the end of the build should contain an oci directory with one image").

But we just went through a large effort to switch our Konflux builds over to the "stock" buildah task they expose. Reworking everywhere we do a build, from the documented local developer entrypoint to Konflux would not be a small thing for us.

It's big enough that if I had to I'd rather spend that engineering time on implementing a more sustainable way to do this in buildah myself.

@cgwalters
Copy link

When we teach the --build-context command line flag about OCI layouts (and possibly archives), the value on the FROM line will be able to be replaced by the nickname given as part of the flag's argument.

Hmm...but here we're generating an entire OCI layout inside a build stage, we can't provide it as a build context at the start of a build.

@nalind
Copy link
Member

nalind commented Jan 29, 2025

I would have (and have previously) suggested solving this at the pipeline level.

Can you elaborate on that? Previously in this comment you said:

and I'm not sure we want to sign up for that when the standard advice for jobs that can't be done with a Dockerfile is to use buildah's other commands.

In what we're doing here we aren't using buildah in any way because we want fine-grained control over the bit for bit output of the OCI archive.

I'm pretty sure that at that point I was referring to "FROM oci-archive:", and if you're feeling charitable, depending on it.

But your point more generally here I would phrase as "don't try to do it via Containerfile" - which definitely makes sense as a starting point, except I'd just reiterate that going from "podman build" to "run this script/Makefile" is a large leap.

Agreed, it's not as simple a story to tell.

Actually today e.g. on MacOS (or more generally with podman-machine) this trick doesn't work; I didn't dig in but I think it has to do with us always copying the build context with podman-remote. But in theory, it could work on podman-remote/podman-machine cases, which is quite important because there's lots of infrastructure built on top of "podman-build". If we have a custom script/pipeline, then especially in podman-remote cases we'd have to do some careful juggling to make sure our build works there.

Hmm...I guess in theory we could implement our build process as "spawn a container image with write access to containers-storage:" (or if one wants to isolate it more, "spawn a container imgae with write access to an empty tempdir, which at the end of the build should contain an oci directory with one image").

I have personally tripped over the overlay driver's mount_program option having to be specified, or not specified, more than I'd like to have, to the point that I tend to want to avoid sharing the storage space. That's before you start worrying about different storage drivers.

The -t flag is supposed to accept any valid containers-transports(5) value, if that's of help.

@cgwalters
Copy link

cgwalters commented Jan 29, 2025

Can you indicate level of support for the "short+medium term options" listed in this comment you have ranging from say "yes I'd implement" to "would take a patch" to "no"? Most especially the "short term" one of lifetime-extending only if we detect FROM oci:

Actually let me just repeat it here, and expand/elaborate:

Option lifetime-extend:

  • Detect FROM oci: in the container build and lifetime expand the overlayfs created by any RUN --mount=type=bind,rw=true,src=<local path> to the entire build
  • Though now that I think about this more since I am pretty sure that the FROM oci: part was actually relying on being able to pick up the OCI archive from the host/default mount namespace by default, we may need to add some special casing here to say that the file path used for that is the last RUN --mount=type=bind,rw=true,src=. (note special casing of .?)

Advantages:

  • Compatibility with existing Containerfiles we use today

Option FROM oci://stage/path

In this the new syntax //stage/ as part of FROM oci://stage/path acts the same as a from= section of RUN --mount=type=cache. Instead of

Advantages:

  • Much clearer and less hacky in general IMO and I think we could try to get this into Docker too

Disadvantages:

  • Not compatible with the Containerfiles we use today, would be a bit of a pain to say "you need to update podman/buildah" but we could deal with it

@nalind
Copy link
Member

nalind commented Jan 29, 2025

Option lifetime-extend:

* Detect `FROM oci:` in the container build and lifetime expand the overlayfs created by any `RUN --mount=type=bind,rw=true,src=<local path>` to the entire build

* Though now that I think about this more since I am pretty sure that the `FROM oci:` part was actually relying on being able to pick up the OCI archive from the host/default mount namespace by default, we may need to add some special casing here to say that the file path used for that is the _last_ `RUN --mount=type=bind,rw=true,src=.` (note special casing of `.`?)

Advantages:

* Compatibility with existing Containerfiles we use today

Having just done the changes to make this an overlay, I can say that the number of places in the source code that this impacts would be high. I also don't relish creating more differences in behavior that a user has to remember and keep track of, so I would lean toward not doing this.

Option FROM oci://stage/path

In this the new syntax //stage/ as part of FROM oci://stage/path acts the same as a from= section of RUN --mount=type=cache. Instead of

Advantages:

* Much clearer and less hacky in general IMO and I think we could try to get this into Docker too 

Disadvantages:

* Not compatible with the Containerfiles we use today, would be a bit of a pain to say "you need to update podman/buildah" but we could deal with it

oci: already has a meaning in this context, and the syntax for it is incompatible with this, so I don't see us doing that.

@cgwalters
Copy link

oci: already has a meaning in this context, and the syntax for it is incompatible with this, so I don't see us doing that.

How would the syntax be incompatible? The leading :// seems to me to be sufficiently unlikely to occur for existing users. But if we wanted to avoid all potential for ambiguity, how about FROM --mount=type=cache,target=/output oci:/output/out.oci ?

@cgwalters
Copy link

Agreed, it's not as simple a story to tell.

It's more than a story, it's about the ecosystem. While there certainly are multiple build systems to build container images, Dockerfile is very, very popular as I think everyone here knows. And that means that actual infrastructure has been built on top of it.

At Red Hat for example with Konflux there's been an investment in a few things like "hermetic" builds where the builds can't fetch arbitrary things from the network, only a restricted set of inputs. Also, there's policies around which container images can be used in tasks.

Both of these things are today applied on top of Dockerfile/Containerfile (though there's also lockfiles for rpms and other source content needed). As soon as "build this project" becomes more freeform than that, it needs custom logic to match these types of policies.

Another related thing is there are some parts of Dockerfile that function perfectly fine and don't need reinventing, such as e.g. LABEL.

The fact that we were able to fit custom container images into the "shape" of a Dockerfile and again continue to reuse existing things like LABEL while producing reproducible tarballs in a more intelligent way possible than Dockerfile has been extremely useful.

@nalind
Copy link
Member

nalind commented Jan 30, 2025

oci: already has a meaning in this context, and the syntax for it is incompatible with this, so I don't see us doing that.

How would the syntax be incompatible? The leading :// seems to me to be sufficiently unlikely to occur for existing users.

There is no leading :// for oci:, as there isn't one for oci-archive:. The portion after that ":" is a filesystem path name, and that's a problem

But if we wanted to avoid all potential for ambiguity, how about FROM --mount=type=cache,target=/output oci:/output/out.oci ?

The ideal for me here is an approach that doesn't depend on implementing custom syntax, and that doesn't break when we fix compliance bugs.

@cgwalters
Copy link

doesn't depend on implementing custom syntax,

Right, it would clearly make sense to try to get this into Docker at some point, though last I looked at that the ecosystem there is around buildx, which basically a whole lot of people who want to build containers in a more intelligent way are using. But maybe they'd be amenable to the FROM --mount=type=cache or whatever. Anyways, that's not a short term thing.


One thing I realized in a conversation just now with the team is that actually because of this writable mount to the build context and some other reasons, we'd required the build to be run with --security-opt=label=disable --cap-add=all --device /dev/fuse already...which is basically --privileged. So the CVE isolation doesn't apply to this current use case anyways.

So here's yet another, hopefully more targeted short term proposal:

Change buildah to detect when the input build is privileged in this way (CAP_SYS_ADMIN in the outer userns, which I am sure could be leveraged into a full breakout); if it's not, no behavior change from current git.

  • extend the lifetime of the most recent host bind mount that references . to the end of the build
  • change FROM oci/FROM oci-archive to in the case where the image referenced is not found in the host mountns, look in that most recent overlayfs

or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants