Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up after unexpectedly terminated build #25102

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Jan 23, 2025

The podman system prune command can remove build containers that were created during the build but were not removed because the build terminated unexpectedly.

By default, build containers are not removed to prevent interference with builds in progress. Use the --build flag when running the command to remove build containers as well.

Reproducer:

  • Containerfile:
FROM ubi8/ubi
RUN truncate -s 10G out
RUN echo "Hi"
RUN sleep infinity
  • Test script run.sh:
#!/usr/bin/env bash

podman build -f Containerfile -t podmanleaker &
sleep 60 && kill -9 $! 
  • measure the size of current images, containers, etc... Before build
podman unshare du -sh ~/.local/share/containers/
  • Test script (Note: requires disk space of about 32 GB)
./run.sh
  • measure the size of current images, containers, etc... After termination of build
podman unshare du -sh ~/.local/share/containers/
  • Clean up leftovers after build
podman system prune --build -f
  • measure the size of current images, containers, etc... After cleanup build
podman unshare du -sh ~/.local/share/containers/

The size should be the same as the first measurement but could be different if a base image is present in the system.

Fixes: #14523
Fixes: #23683
Fixes: https://issues.redhat.com/browse/RHEL-62009

Does this PR introduce a user-facing change?

The `podman system prune` command now supports removing build containers with the new `--build` option. 

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Jan 23, 2025
Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign giuseppe for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Honny1 Honny1 force-pushed the prune branch 2 times, most recently from 596c6bb to c607887 Compare January 23, 2025 10:34
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Jan 23, 2025
if err != nil {
return stageContainersPruneReports, err
}
if _, err := os.Stat(filepath.Join(path, "buildah.json")); errors.Is(err, fs.ErrNotExist) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fileutils.Exists()

@nalind Is there another (better?) way to check if a storage container is from buildah?

size, err := r.store.ContainerSize(container.ID)
if err != nil {
report.Err = err
logrus.Warnf("Failed to get size of build stage container %s: %v", container.ID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not report the error to the caller and also print a warning.
Just reporting the errors back to the caller is good enough. The caller can then log once.

Comment on lines 1299 to 1300
report.Err = err
logrus.Warnf("Failed to remove build stage container %s: %v", container.ID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


reclaimedSpace := (uint64)(0)
found := true
for found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a loop? Running once should be enough.

Comment on lines 79 to 103
stageContainersPruneReports, err := ic.Libpod.PruneStageContainers()
if err != nil {
return nil, err
}
if len(stageContainersPruneReports) > 0 {
found = true
}
reclaimedSpace += reports.PruneReportsSize(stageContainersPruneReports)
systemPruneReport.ContainerPruneReports = append(systemPruneReport.ContainerPruneReports, stageContainersPruneReports...)

// Prune Images
imagePruneOptions := entities.ImagePruneOptions{
External: true,
BuildCache: true,
}
imageEngine := ImageEngine{Libpod: ic.Libpod}
imagePruneReports, err := imageEngine.Prune(ctx, imagePruneOptions)
if err != nil {
return nil, err
}
if len(imagePruneReports) > 0 {
found = true
}
reclaimedSpace += reports.PruneReportsSize(imagePruneReports)
systemPruneReport.ImagePruneReports = append(systemPruneReport.ImagePruneReports, imagePruneReports...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with the documentation, from the docs you wrote it sounds like it only prune the build containers but this prunes everything other containers and images as well.
I don't think this is desirable.
If we want that behavior then the code should not cause a conflict with --build option and only do this in addition to the existing code in SystemPrune() instead of duplicating the image logic here.

Comment on lines 621 to 625
hasNone, result := none.GrepString("<none>")
Expect(result).To(HaveLen(1))
Expect(hasNone).To(BeTrue())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to horrible Expected false to be true errors.

Please use the proper matchers, something like Expect(none.OutputToString()).To(ContanSubstring("none"))

Comment on lines 626 to 628
dirents, err := os.ReadDir(containerStorageDir)
Expect(err).ToNot(HaveOccurred())
Expect(dirents).To(HaveLen(6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this matter? This peaks at rather internal storage details which can change requiring the test to be updated often. Would it not be better to run buildah containers instead? Then in the end ensure it is removed there?

after := podmanTest.Podman([]string{"images", "-a"})
after.WaitWithDefaultTimeout()
Expect(after).Should(ExitCleanly())
Expect(len(after.OutputToStringArray())).To(BeNumerically(">", 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this should check, AFAIK in the test/e2e setup we will always have extra images from the additional store shown.

Comment on lines 645 to 649
hasNoneAfter, result := after.GrepString("<none>")
Expect(result).To(BeEmpty())
Expect(hasNoneAfter).To(BeFalse())
hasNotLeakerImager, _ := after.GrepString("notleaker")
Expect(hasNotLeakerImager).To(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs to use proper matchers

Comment on lines 651 to 654
// still have: volatile-containers.json, containers.json, containers.lock and container dir
dirents, err = os.ReadDir(containerStorageDir)
Expect(err).ToNot(HaveOccurred())
Expect(dirents).To(HaveLen(4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment I would use buildah containers to check

@Luap99
Copy link
Member

Luap99 commented Jan 23, 2025

Also what this doesn't answer what happens if the build process is still running? And another thing that this does not cleanup is the networking, not sure if there is a good way for that but if you kill the build and run with bridge networking the interfaces/firewall rules and ipam db allocations are still leaked. They should be cleaned after reboot so not that bad like the storage leak and certainly not a blocker for this here. I just mention it as there are other leaks to consider too.

@Honny1
Copy link
Member Author

Honny1 commented Jan 23, 2025

@Luap99 I have incorporated your review. I've changed the approach and the --build flag allows the removal of build/stage containers for the podman system prune command, so it works the same as --volumes. The networking should also be cleaned.

@Honny1 Honny1 marked this pull request as ready for review January 23, 2025 15:21
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2025
@Honny1 Honny1 requested a review from Luap99 January 23, 2025 15:46
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, code wise this seems good to me. just a minor comment on the man page and we need to make sure the test does not run forever


Removes any build containers that were created during the build, but were not removed because the build was unexpectedly terminated.

> **This is not safe operation and should be executed only when no builds are in progress. It can interfere with builds in progress.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general style in the man pages was to write Note: ... or something like that. I don't think we have used > elsewhere. I don't mind it for the web view but I have not yet looked at how it looks in the rendered man page.

Comment on lines 606 to 614
if build.LineInOutputContains("Please use signal 9") {
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great but maybe add a small sleep for like 10ms or something to not busy poll which also eats a lot of resources.

Second this needs a timeout, if the build process fails and never prints the output this would loop forever causing hard to debug issues in CI.

The `podman system prune` command is able to remove build containers that were created during the build, but were not removed because the build terminated unexpectedly.

By default, build containers are not removed to prevent interference with builds in progress. Use the **--build** flag when running the command to remove build containers as well.

Fixes: https://issues.redhat.com/browse/RHEL-62009

Signed-off-by: Jan Rodák <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Change to remote API; merits scrutiny release-note
Projects
None yet
2 participants