Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rpm-ostree compose build-chunked-oci #5221

Closed
cgwalters opened this issue Jan 13, 2025 · 18 comments · Fixed by #5222
Closed

Add rpm-ostree compose build-chunked-oci #5221

cgwalters opened this issue Jan 13, 2025 · 18 comments · Fixed by #5222

Comments

@cgwalters
Copy link
Member

cgwalters commented Jan 13, 2025

xref https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2295033343

Basically

RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared \
    rpm-ostree compose build-oci [--rootfs=/] /buildcontext/out.oci

We have the two core pieces of this in ostree commit + rpm-ostree compose container-encapsulate. However - doing this from the mounted rootfs gets ugly as we obviously don't want to traverse into mount points like /proc - ostree never learned to skip mount points since we always told people to do the separate rootfs.

What's worse is things like /tmp (and /var/tmp) won't be mount points - but we definitely don't want to traverse into those either.

Hmm. A core thing driving us to the "nested container" was precisely so we could operate on a separate rootfs. But it'd clearly be really useful if we didn't need nested container perms.

I think these heuristics would be OK to start:

  • Walk the toplevel /
  • Skip mount points (perhaps consult the rpmdb for the canonical permissions)
  • Skip /tmp
  • Also, error if /boot is nonempty (xref bootc container lints)
  • Walk everything else except /var/tmp ? Obviously we could just try to avoid using /var/tmp ourselves but that seems...messy to commit to.

Hmm maybe it'd add some flexibility if we supported something like setfattr -n user.rpmostree.skip -v 1 so that things could be omitted without actually being deleted at that time. Then in fact one could add that xattr to even /usr/bin/rpm-ostree to avoid shipping it. And while we're here, probably honor CACHEDIR.TAG.


EDIT: Actually of course it's possible to use COPY --from=rootfs / /rootfs or even better

RUN --mount=from=rootfs,target=/rootfs --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared \
    rpm-ostree compose build-oci --rootfs=/rootfs /buildcontext/out.oci

to access the cleanly split out root from the mid stage image. So let's just aim for that.


Dependencies:

[ ] containers/bootc#1032
[ ] #5225

cgwalters added a commit to cgwalters/bootc that referenced this issue Jan 14, 2025
A lot of technical debt here. A long time ago I added this hacky bit
to inject var/tmp is the container stream even if it wasn't in the
ostree commit.

Today things shipped by `rpm-ostree compose image` like FCOS
don't have `var/tmp` in the commit.

But then more recently we started shipping `/var/tmp`
in base images directly.

Now I'm working on coreos/rpm-ostree#5221
where we're rechunking from a rootfs that does have var/tmp
and that ends up in the ostree commit.

The path comparison here was wrong because the tar stream we
generate has the paths start with `./` and a literal comparison
doesn't match `./var/tmp` != `var/tmp`.

Add a canonicalization helper and use it for this.

Signed-off-by: Colin Walters <[email protected]>
cgwalters added a commit to cgwalters/ostree-rs-ext that referenced this issue Jan 14, 2025
A lot of technical debt here. A long time ago I added this hacky bit to inject var/tmp is the container stream even if it wasn't in the ostree commit.

Today things shipped by `rpm-ostree compose image` like FCOS don't have `var/tmp` in the commit.

But then more recently we started shipping `/var/tmp` in base images directly.

Now I'm working on coreos/rpm-ostree#5221 where we're rechunking from a rootfs that does have var/tmp and that ends up in the ostree commit.

The path comparison here was wrong because the tar stream we generate has the paths start with `./` and a literal comparison doesn't match `./var/tmp` != `var/tmp`.

Add a canonicalization helper and use it for this.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Jan 14, 2025
@antheas
Copy link

antheas commented Jan 15, 2025

Would it be possible to build a variant in ostree-rs-ext directly and allow using ostree-ext-cli instead as an option?

It would remove the fedora dependency for one, and for second, the chunking algorithm in rpm-ostree leaves something to be desired.

Also, you need to make sure you allow the user to force hardlinks or bail for creating the ostree commit in the image for space considerations (see https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2297736215). This way, the only additional copy created is the .oci artifact

Then, if the user wants to avoid the third and fourth copies, they can load the resulting oci artifact and do the cleanup outside the container.

@antheas
Copy link

antheas commented Jan 15, 2025

I am also looking at simplifying the rechunk action and this would be a great opportunity to do it. It would solve the permissions issues for one.

Also note that this step is VERY performance sensitive as it runs after every build and can take as much as the build it

For example, all the following lines do is add 5-6m to the build. Also, when you load an oci image produced by ostree-rs-ext into podman, it grows by around half a gig in our usecase due to inferior compression

FROM oci:./out.oci
# Need to reference builder here to force ordering. But since we have to run
# something anyway, we might as well cleanup after ourselves.
RUN --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared \
      rm /buildcontext/out.oci -rf

Much cheaper to use skopeo to upload instead

@cgwalters
Copy link
Member Author

Hi @antheas thanks for your comments! rechunk is very cool. However a structural issue is that I need to ship something that "my" team maintains, lifecycled/versioned with the rest of our tools. Adding a new container image would be a nontrivial lift at this time, and the core part I care about is already mostly running ostree/rpm-ostree code anyways. We could ship the code in an existing container, but...

A lot more on related bits below:

and for second, the chunking algorithm in rpm-ostree leaves something to be desired.

Yes, you're right it is certain that for larger images (usually desktops) we'll want something better and more configurable as you've implemented in rechunk. My PoV here is to allow users to not regress from the tooling we use to build fedora-bootc derivatives today to start.

Bigger picture future

Would it be possible to build a variant in ostree-rs-ext directly and allow using ostree-ext-cli instead as an option? It would remove the fedora dependency for one,

On the bootc side, we already support ingesting images without ostree-in-container at all. The tooling here will default to generating that in order to maintain backwards compatibility. But, as we think towards next steps, I think where we want to go is maintaining a "generic" rechunker that could also apply to docker/podman app images (or really even flatpak-OCI). I don't think it'd be really hard to take the code we have and slightly generalize it, dropping most of the ostree stuff (although we will want to support e.g. selinux labeling and other things at build time).

I'd love to collaborate on that in say a repo in github.com/containers - although we'd need to work through some details.

Back to details: Workflow

Today rechunk is defined to run as a Github action primarily. While clean support of GHA is important, for what I ship it must be straightforward to use in as many container build environments as possible. This is why the design (in https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2295033343 ) to start uses a buildah/podman feature where we can have the build just be a Containerfile - so anything that knows how to consume that could operate on it. Now of course this is a podman-ecosystem specific feature (though nothing would stop us from arguing to add it to docker/moby).

In the end of course with what we have in this code you aren't at all required to use that Containerfile flow - as you say one can also write the ociarchive and then upload it from there instead of injecting into containers-storage. I'll look to document how to do that flow as well.

Size issues

Today we end up with 4 uncompressed copies at the end of the containerfile flow (intermediate image, ostree repo, oci archive, containers-storage). I think with the ostree flow I can teach things so that we don't create a temporary ostree-repo, we just compute the checksums in memory. That'd drop out one whole copy. We could probably teach podman to have something like:

FROM oci-archive+consume:./out.oci where it'd delete the layers from the oci archive when converting back to containers-storage. That'd drop out another copy.

I think the only way to then fix the "intermediate image" copy would be some explicit way to flush that out like

RUN rpm-ostree experimental compose build-chunked-oci --ostree --rootfs=/rootfs --output /buildcontext/out.oci
FLUSH rootfs

(Making up a FLUSH instruction to explicitly say "this stage is no longer needed" - or maybe with podman build --no-cache it should eagerly flush by default for things like this)

@antheas
Copy link

antheas commented Jan 15, 2025

Hi @antheas thanks for your comments! rechunk is very cool. However a structural issue is that I need to ship something that "my" team maintains, lifecycled/versioned with the rest of our tools. Adding a new container image would be a nontrivial lift at this time, and the core part I care about is already mostly running ostree/rpm-ostree code anyways. We could ship the code in an existing container, but...

Of course. I never said rechunk is enterprise ready. You should re-use your IP with rpm-ostree to make something that works today. I was just noting some roadblocks I faced and you will inevitably face as well.

...in the process perhaps nudging some of those PRs towards ostree-rs-ext so I can also use them :)

The major two roadblocks we have faced are storage and speed. And I don't think those will be unique to GHA. Bazzite builds in 10m, rechunk takes 7m to run and 3m to upload. And I think those scale linearly with each other. So with rechunk we are at 2x.

I think the bandwidth savings from rechunking make it worth it just from a deployment time perspective at 2x. However, if you're not careful and instead of 2x you start talking about 3x or 4x... It gets a bit harder to justify. Same with all your builders suddenly requiring 4x the storage.

There is also a third issue. Since you are not using rpm-ostree to unwrap the image as an ostree commit, you open a can of worms that has been haunting me. Its the 1_prune.sh file. Essentially, there is a drift in deployment behavior between the original and the rechunked one. And depending on the serialization of the previous layers, you might lose some directory/file permissions/owners etc.

Hopefully, your ostree-rs-ext PRs 1-2 months ago helped in that and now OCI serialized commits have correct permissions.

In rechunk I really want to do a few things next time I clean it up in a few months. I want to keep or lower the 2x overhead while fixing the following. I want to hopefully remove the need for the 1_prune.sh file. The SSSD changes have been hell, really, not only does F40 need different caps than F41, but they changed the caps twice causing 2 breakages. Then, I really dont like the fact I do a dirty podman mount of the original image and then mount that externally to the new image. It is sketchy and breaks rootless podman.

@cgwalters
Copy link
Member Author

@vrothberg and I were chatting and maybe one thing that could help here would be for us to ship the last half of the reference code as /usr/share/doc/rpm-ostree/Containerfile.custom. Then the UX could look like this:

FROM quay.io/fedora/fedora-bootc:41 as rootfs
RUN <<EORUN
set -xeuo pipefail
# Needed by some scripts
mkdir -p /var/lib/rpm-state
# Remove various things with dnf
dnf -y remove python3 linux-firmware rpm-ostree
# And drop out dnf (TODO: teach dnf to be able to do this)
rpm -e dnf5 libdnf5 libdnf5-cli libsolv librepo libmodulemd
rm /var/lib/rpm-state -rf
EORUN

#include /usr/share/doc/rpm-ostree/Containerfile.custom

or whatever, probably in practice they may not have /usr/share/doc/rpm-ostree but at least they could copy-paste the second half from docs and just not need to change it often - the "api" is that they have to name their first stage rootfs and that's it.

@antheas
Copy link

antheas commented Jan 15, 2025

I thought about it some more.

If the fix of ostree-rs-ext dropping special attrs, permissions, etc is deployed and we teach ostree to complete the steps that are leftover in 1_prune.sh such as the polkit user, sysroot, I can remove the mutations I do currently and use this pattern myself, which will be a lot nicer. It is something you will have to do anyway to deploy this.

Then if its also a memory model, we can drop the extra copy and maybe it will be faster.

As for the final copy re: FROM oci-archive+consume:, I think it is something that storage conscious consumers skip by uploading the oci directly, so I would not worry about it for now

If all those are in place, I also think I will go this route and drop/simplify the github action

Sidenote: The rechunk algorithm is not something that would take more than a couple of days to rewrite in rust but I would still prefer you leave some leeway or avenue for prototyping/using something better.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Jan 15, 2025
@cgwalters
Copy link
Member Author

There is also a third issue. Since you are not using rpm-ostree to unwrap the image as an ostree commit, you open a can of worms that has been haunting me. Its the 1_prune.sh file. Essentially, there is a drift in deployment behavior between the original and the rechunked one. And depending on the serialization of the previous layers, you might lose some directory/file permissions/owners etc.

Thank you. Indeed skimming that I see this approach is affected by several things. Taking just one of them in the security.capability xattr:
https://github.com/hhd-dev/rechunk/blob/0aee79e67dc9354779924ccc0fcdd7ef0b874d0d/1_prune.sh#L190

Yes, the problem I think is that we're not synthesizing those xattrs into the tar stream even for the existing base image. I will look at doing this, though it opens up a bit more of a can of worms around security.selinux.

@jlebon
Copy link
Member

jlebon commented Jan 15, 2025

There is also a third issue. Since you are not using rpm-ostree to unwrap the image as an ostree commit, you open a can of worms that has been haunting me. Its the 1_prune.sh file. Essentially, there is a drift in deployment behavior between the original and the rechunked one. And depending on the serialization of the previous layers, you might lose some directory/file permissions/owners etc.

Wow, yes tricky. There's clearly a lot to dig through there and unwind. I think a target should probably be to have a CI test which compares a base image built by rpm-ostree compose image and one which extends that base image with just the rechunking and verifies that the diff is minimal/well understood.

@cgwalters
Copy link
Member Author

Taking just one of them in the security.capability xattr:

containers/bootc#1032

@cgwalters
Copy link
Member Author

Also, when you load an oci image produced by ostree-rs-ext into podman, it grows by around half a gig in our usecase due to inferior compression

This is interesting. It's quite possible that the gzip level (or actually vendored version of gzip) differs from what we happen to use in ocidir. Although, a big picture issue there is that because gzip is pretty stable, we haven't changed much over time for layers. But the moment we rev the compression it can blow out the cache entirely. cc docker-library/official-images#17720 (comment)

@vrothberg
Copy link
Collaborator

@vrothberg and I were chatting and maybe one thing that could help here would be for us to ship the last half of the reference code as /usr/share/doc/rpm-ostree/Containerfile.custom

The rationale behind is as follows:

  • The workflow already depends on Podman/Buildah specific features.
  • Less copy-paste for users.
  • The Containerfile/Dockerfile must end with .in for the #include directive to work (see buildah-build docs). Many moons ago I added support for .in files on which we'll run the cpp.

@cgwalters
Copy link
Member Author

Yeah this is interesting; comparing the container filesystem vs the rpmdb, but ignoring missing files and pure timestamp changes:

$ podman run --rm -ti quay.io/fedora/fedora-coreos:stable bash -c 'rpm -qaV  | grep -Fv ".......T. " |grep -v missing'
S.5....T.    /usr/lib/os-release
.M.......    /boot
....L....    /home
....L....    /media
....L....    /mnt
....L....    /opt
....L....    /root
....L....    /srv
.M.......    /usr/bin
.M.......    /usr/lib
.M.......    /usr/lib/games
.M.......    /usr/lib64
.M.......    /usr/lib64/X11
.M.......    /usr/lib64/bpf
.M.......    /usr/lib64/games
.M.......    /usr/lib64/pm-utils
.M.......    /usr/lib64/pm-utils/module.d
.M.......    /usr/lib64/pm-utils/power.d
.M.......    /usr/lib64/pm-utils/sleep.d
....L....    /usr/local
.M.......    /usr/sbin
.M.......    /usr/share/empty
.M.......    /etc/grub.d
S.5....T.  c /etc/selinux/semanage.conf
S.5....T.  c /etc/default/useradd
.......TP    /usr/bin/newgidmap
.......TP    /usr/bin/newuidmap
.M.......  n /etc/nftables
.M.......  n /etc/nftables/osf
.M....GT.    /usr/libexec/utempter/utempter
.M.......    /etc/audit
.M.......    /etc/audit/rules.d
.M....GT.    /usr/bin/write
.M.......    /etc/ssh/sshd_config.d
....L....    /usr/lib/systemd/system/default.target
.M.......    /etc/ssh/sshd_config.d
.......TP    /usr/bin/arping
.......TP    /usr/bin/clockdiff
......G..    /etc/polkit-1/rules.d
.....UG..    /etc/sssd
.....UG..    /etc/sssd/conf.d
.....UG..    /etc/sssd/pki
......GTP    /usr/libexec/sssd/sssd_pam
......GTP    /usr/libexec/sssd/krb5_child
......GTP    /usr/libexec/sssd/ldap_child
S.5....T.    /etc/selinux/targeted/contexts/files/file_contexts.subs_dist
.M.......    /usr/lib/containers/storage/overlay-images
.M.......    /usr/lib/containers/storage/overlay-layers
S.5....T.    /etc/rpm-ostreed.conf
......GTP    /usr/libexec/sssd/selinux_child
.M.......    /etc/lvm/devices
.M.......    /usr/libexec/initscripts/legacy-actions/auditd
.M.......    /etc/sudoers.d
.M.....T.    /usr/bin/sudoreplay
.M.......    /usr/share/mdadm

The things that are just TP will be fixed by containers/bootc#1032

That said, there is a workaround we can do here which is to pull that data from the source ostree commit (or from the rpmdb). But it shouldn't be too hard to get our base images rebuilt.

The next thing I want to dig into here is the ones that are .M....... like /usr/share/empty - that's just strange.

@cgwalters
Copy link
Member Author

The next thing I want to dig into here is the ones that are .M....... like /usr/share/empty - that's just strange.

It's

// Add the "user writable" permission bit for a directory.
- I'd long since forgotten about that one...not sure I'd say it's a real problem, but I guess nowadays we could probably revert it back...or honestly - you know what the real fix is: composefs where we maintain the permissions out of band.

@jlebon
Copy link
Member

jlebon commented Jan 17, 2025

EDIT: Actually of course it's possible to use COPY --from=rootfs / /rootfs or even better

Is the main thing we're trying to avoid with this to have to filter out mountpoints? The UX does suffer a bit as a result. We definitely should support the non-/ case as well, though it seems beneficial to also support --rootfs=/ in the common case where the tool is part of the built image. That reduces it to one extra stage only for the FROM oci: trick.

I think where we want to go is maintaining a "generic" rechunker that could also apply to docker/podman app images

Yes, definitely. It feels like that work could happen in buildah? (And exposed via podman as well). (And in line with the above given that podman currently is guaranteed to be in your package set.)

@cgwalters
Copy link
Member Author

Is the main thing we're trying to avoid with this to have to filter out mountpoints?

That's part of it, but it's not hard to avoid crossing mountpoints. But the real thing we want to support is shipping images that drop out rpm-ostree (or in general, tools that we may use at build time to generate the OCI). The demonstration flow I have there shows that working. It'd likely be possible of course to require rpm-ostree (or whatever build tool) to have no external dependencies other than the most minimal base image (probably say just glibc) and then one could cp /usr/bin/rpm-ostree /run/rpm-ostree && dnf -y remove rpm-ostree && /run/rpm-ostree build-chunked-oci ... but...yeah a bit ugly.

But yes...it probably wouldn't be too hard to make --rootfs=/ supported for the people who mostly want to e.g. swap kernels but still optimize the image.

cgwalters added a commit to cgwalters/bootc that referenced this issue Jan 17, 2025
Internal only change, but prep for future work; I want to track
the version lints were introduced in, support clearly differentiating
between warnings and fatal errors, etc.

Specifically motivated by custom base image work
coreos/rpm-ostree#5221
but also obviously useful in general.

Signed-off-by: Colin Walters <[email protected]>
cgwalters added a commit to cgwalters/bootc that referenced this issue Jan 17, 2025
Internal only change, but prep for future work; I want to track
the version lints were introduced in, support clearly differentiating
between warnings and fatal errors, etc.

Specifically motivated by custom base image work
coreos/rpm-ostree#5221
but also obviously useful in general.

Signed-off-by: Colin Walters <[email protected]>
cgwalters added a commit to cgwalters/bootc that referenced this issue Jan 17, 2025
Internal only change, but prep for future work; I want to track
the version lints were introduced in, support clearly differentiating
between warnings and fatal errors, etc.

Specifically motivated by custom base image work
coreos/rpm-ostree#5221
but also obviously useful in general.

Signed-off-by: Colin Walters <[email protected]>
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Jan 20, 2025
@p5
Copy link

p5 commented Jan 21, 2025

Could I ask where you vision this being used?
Is this solely for creating base images via composes?

Or would it also be useful for custom images?

  • "I have a few very large layers at the end of my derived image" ✔ - I think so?
  • "I want more cache hits during updates of my derived image" ❌ - Don't think so?

@cgwalters
Copy link
Member Author

Is this solely for creating base images via composes?

No.

Or would it also be useful for custom images?

Yes. It's for creating derived images that "feel like" base images. A lot more in
https://gitlab.com/fedora/bootc/tracker/-/issues/32

Although as the docs say, nothing requires you technically to run through a container derivation flow to create the rootfs - so you certainly can use this to make a base image how you want to, while still picking up some of the logic we have. Also actually with bootc especially after containers/bootc#887, nothing requires you to use this tool at all!

"I have a few very large layers at the end of my derived image"
"I want more cache hits during updates of my derived image"

Yes, these are two of the issues that might drive one to use this approach.

@p5
Copy link

p5 commented Jan 28, 2025

Been trying to play around with this recently, though I keep facing errors where the build-chunked-oci command throws an error regarding unsupported regular files being present.

Generating commit...
/proc/self/fd/9/etc/selinux/targeted/contexts/files/file_contexts.bin: Old compiled fcontext format, skipping
/proc/self/fd/9/etc/selinux/targeted/contexts/files/file_contexts.homedirs.bin: Old compiled fcontext format, skipping
error: Unsupported regular file error.log at toplevel

Perhaps there's some validation steps which can be done beforehand (like as part of bootc container lint or the build-chunked-oci command) which alerts the user of the full path for these unsupported files. At the moment, the error isn't the most useful since I would have expected an /error.log based on the message, though it's not present.

https://github.com/rsturla/eternal-images/actions/runs/13013519960/job/36296857271?pr=349#step:7:6955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants