-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large (10+GB) Docker images cause excess memory usage when pulled from local Docker daemon #1214
Comments
Thanks for the info, I also found that the docker daemon doesn't really give you useful logs either when you're pulling it this way. We did notice some inexplicable front loading performance hits we couldn't explain (took a few extra seconds than expected to start pulling the image from the daemon). This might just be an inefficiency with how crane is interacting with the docker daemon :-<. For now, to regain existing functionality, you can keep passing the flag or set the field I think we'll need to do some testing to reproduce this, I assume it's just as bad with
It does look like swap is 0, I wonder if we could play with turning it on so it has room to grow, just to see what it does and how it compares to the crane caching/pulling performance. |
di2me is a reliable way to get zarf to be OOM killed if all of the images exist in the docker cache. Bash snippet below to pull all of the images into docker from the zarf.yaml, adjust the |
It is specifically this line that it runs when it initializes the image and then loads the tarball into memory: https://github.com/google/go-containerregistry/blob/main/pkg/v1/daemon/image.go#L56 |
I do not believe this would happen on a local tarball since that instead uses golang syscalls to do the work without loading everything into memory. |
I think we'll need to whip up a super simple golang example of this breaking in crane and then open an issue on https://github.com/google/go-containerregistry/issues |
Did some testing on this over the weekend, wanted to confirm there wasn't anything else gross/odd with our setup and pulled the code into a single go file. Tested with both tar method we use today and OCI and had very similar results. Looks like we can avoid the memory issue by running with the unbuffered option, but it is also much slower 😭. I tested a 20 gig image, docker save took 2.5 mins, buffered took 7.5 mins, didn't even let unbuffered finish. On a 1 gig image, buffered took 18s, unbuffered took 1m20s, but memory was also substantially better. Actually not sure where this leaves us except maybe crane's implementation isn't the best one and a fallback to Stats from 20 gig image buffered as it ran: Reference dockerfile:
Reference main.go: package main
import (
"errors"
"fmt"
"io"
"os"
"github.com/defenseunicorns/zarf/src/pkg/message"
"github.com/defenseunicorns/zarf/src/pkg/utils"
"github.com/google/go-containerregistry/pkg/logs"
"github.com/google/go-containerregistry/pkg/name"
v1 "github.com/google/go-containerregistry/pkg/v1"
"github.com/google/go-containerregistry/pkg/v1/daemon"
"github.com/google/go-containerregistry/pkg/v1/tarball"
)
var (
images = os.Args[1:]
path = "crane-image-test.tar"
)
func main() {
// Logging setup
message.SetLogLevel(message.TraceLevel)
spinner := message.NewProgressSpinner("Loading %d images", len(images))
defer spinner.Stop()
logs.Warn.SetOutput(spinner)
logs.Progress.SetOutput(spinner)
// Create a map of the image to be written to the tarball
refToImage := map[name.Reference]v1.Image{}
imgMap := make(map[string]v1.Image)
for _, image := range images {
spinner.Updatef("Loading image %s", image)
// Parse the image reference
ref, err := name.ParseReference(image)
if err != nil {
spinner.Fatal(err)
}
// Load the image from the local docker daemon
img, err := daemon.Image(ref, daemon.WithUnbufferedOpener())
if err != nil {
spinner.Fatal(err)
}
refToImage[ref] = img
imgMap[image] = img
}
spinner.Success()
// Progress bar setup
var (
progress = make(chan v1.Update, 200)
progressBar *message.ProgressBar
title string
)
// OCI variant
// p, err := layout.FromPath("test-oci")
// if err != nil {
// p, err = layout.Write("test-oci", empty.Index)
// if err != nil {
// spinner.Fatal(err)
// }
// }
// for _, img := range refToImage {
// if err = p.AppendImage(img); err != nil {
// spinner.Fatal(err)
// }
// }
go func() {
_ = tarball.MultiRefWriteToFile(path, refToImage, tarball.WithProgress(progress))
}()
for update := range progress {
switch {
case update.Error != nil && errors.Is(update.Error, io.EOF):
progressBar.Success("Pulling %d image (%s)", len(images), utils.ByteFormat(float64(update.Total), 2))
return
case update.Error != nil:
spinner.Fatal(update.Error)
default:
title = fmt.Sprintf("Pulling %d images (%s of %s)", len(images),
utils.ByteFormat(float64(update.Complete), 2),
utils.ByteFormat(float64(update.Total), 2),
)
if progressBar == nil {
progressBar = message.NewProgressBar(update.Total, title)
}
progressBar.Update(update.Complete, title)
}
}
} |
## Description Updates local image behavior, changing back to crane first with docker as an optional fallback when needed. Large images from docker will still be slow, but should not cause OOM issues and a new warning was added to communicate an alternate approach to speed up local dev with large images. Also cleans up some of the exec package presentation logic. ## Related Issue Fixes #1214 Fixes #1267 ## Type of change - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [ ] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed --------- Co-authored-by: Wayne Starr <[email protected]>
## Description Updates local image behavior, changing back to crane first with docker as an optional fallback when needed. Large images from docker will still be slow, but should not cause OOM issues and a new warning was added to communicate an alternate approach to speed up local dev with large images. Also cleans up some of the exec package presentation logic. ## Related Issue Fixes #1214 Fixes #1267 ## Type of change - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [ ] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed --------- Co-authored-by: Wayne Starr <[email protected]>
Environment
Device and OS: Fedora 37 in AWS EC2, self hosted GitHub Actions runner (not containerized)
App version: v0.23.5
Kubernetes distro being used: N/A
Other: Really big Docker images (12-20GB) in local Docker daemon
Steps to reproduce
docker build
a ~20GB container imagedocker push
(shouldn't matter, it isn't used).zarf package create
referencing the just-built image in the local Docker daemon. Log output should includeloading image from docker daemon
.(See https://github.com/naps-dev/mockingbird/pull/3 for GitHub Actions pipeline triggering issue.)
Expected result
Zarf is not OOM killed.
Large images work fine when not pulled from the local registry (when using
--no-local-images
or on older Zarf versions).Actual Result
Zarf is OOM killed due to high memory usage.
Visual Proof (screenshots, videos, text, etc)
Zarf logs from
/tmp
don't say much except confirming that it is pulling from the Docker daemon:loading image from docker daemon
. I didn't turn on any specific debug flags though./tmp/zarf-2023-01-17-22-54-56-3348030351.log
Journalctl logs including kernel OOM logs. Also a mention of
dockerd
and a possibly relevant error.journalctl
Severity/Priority
Additional Context
With Zarf v0.23.5, using the new feature "Add support for local docker image and image archives on package create" (#1173), I'm seeing OOM kills when it pulls really large images out of the local Docker daemon.
For context: I'm building KubeVirt virtual machine containerDisk images (~18 GB raw, 12GB after docker targz compression) in a GitHub actions pipeline. (Basically scratch Docker images that contain a VM image.) I then package the Docker image plus a Helm chart with KubeVirt manifests with Zarf. This new feature should be great, I can avoid needlessly re-pulling my just-built image. Except I'm getting OOM killed.
I quickly saw Zarf use around 4.5GB memory usage and climb until it was OOM killed. It completes without issue if I pass
--no-local-images
.The text was updated successfully, but these errors were encountered: