Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large (10+GB) Docker images cause excess memory usage when pulled from local Docker daemon #1214

Closed
bburky opened this issue Jan 18, 2023 · 6 comments · Fixed by #1332
Closed

Comments

@bburky
Copy link
Contributor

bburky commented Jan 18, 2023

Environment

Device and OS: Fedora 37 in AWS EC2, self hosted GitHub Actions runner (not containerized)
App version: v0.23.5
Kubernetes distro being used: N/A
Other: Really big Docker images (12-20GB) in local Docker daemon

Steps to reproduce

  1. docker build a ~20GB container image
  2. Optionally, docker push (shouldn't matter, it isn't used).
  3. zarf package create referencing the just-built image in the local Docker daemon. Log output should include loading image from docker daemon.
  4. Zarf will be OOM killed

(See https://github.com/naps-dev/mockingbird/pull/3 for GitHub Actions pipeline triggering issue.)

Expected result

Zarf is not OOM killed.

Large images work fine when not pulled from the local registry (when using --no-local-images or on older Zarf versions).

Actual Result

Zarf is OOM killed due to high memory usage.

Visual Proof (screenshots, videos, text, etc)

Zarf logs from /tmp don't say much except confirming that it is pulling from the Docker daemon: loading image from docker daemon. I didn't turn on any specific debug flags though.

/tmp/zarf-2023-01-17-22-54-56-3348030351.log
  DEBUG   2023-01-17T22:54:56Z  -  packager.createComponentPaths({
            "name": "mockingbird",
            "description": "Mockingbird Sandbox",
            "required": true,
            "only": {
              "cluster": {}
            },
            "import": {
              "path": ""
            },
            "scripts": {},
            "manifests": [
              {
                "name": "mockingbird-helm",
                "namespace": "mockingbird",
                "files": [
                  "manifests/namespace.yaml",
                  "manifests/gitrepository.yaml",
                  "manifests/helmrelease.yaml"
                ]
              }
            ],
            "images": [
              "123456789123.dkr.ecr.us-east-1.amazonaws.com/mockingbird:dockerize"
            ],
            "repos": [
              "https://github.com/naps-dev/mockingbird.git"
            ]
          })
└ (/home/runner/work/zarf/zarf/src/pkg/packager/create.go:227)
  •  Loading 3 K8s manifests
  DEBUG   2023-01-17T22:54:56Z  -  Active spinner already exists
└ (/home/runner/work/zarf/zarf/src/pkg/message/spinner.go:24)
  DEBUG   2023-01-17T22:54:56Z  -  Unable to load an existing git credentials file: invalid argument
└ (/home/runner/work/zarf/zarf/src/pkg/utils/auth.go:88)
  DEBUG   2023-01-17T22:54:56Z  -  Unable to load an existing netrc file: invalid argument
└ (/home/runner/work/zarf/zarf/src/pkg/utils/auth.go:167)
  DEBUG   2023-01-17T22:54:56Z  -  Repo already cloned, fetching upstream changes...
└ (/home/runner/work/zarf/zarf/src/internal/packager/git/pull.go:69)
  DEBUG   2023-01-17T22:54:56Z  -  git.fetch([]config.RefSpec(nil))
└ (/home/runner/work/zarf/zarf/src/internal/packager/git/pull.go:71)
  DEBUG   2023-01-17T22:54:56Z  -  Attempting to find ref: []config.RefSpec(nil) for https://github.com/naps-dev/mockingbird.git
└ (/home/runner/work/zarf/zarf/src/internal/packager/git/pull.go:71)
  DEBUG   2023-01-17T22:54:56Z  -  Unable to load an existing git credentials file: invalid argument
└ (/home/runner/work/zarf/zarf/src/pkg/utils/auth.go:88)
  DEBUG   2023-01-17T22:54:56Z  -  Unable to load an existing netrc file: invalid argument
└ (/home/runner/work/zarf/zarf/src/pkg/utils/auth.go:167)
  DEBUG   2023-01-17T22:54:56Z  -  Already fetched requested ref
└ (/home/runner/work/zarf/zarf/src/internal/packager/git/fetch.go:76)
  •  Loading 3 K8s manifests
  •  Loading 3 K8s manifests
  DEBUG   2023-01-17T22:54:56Z  -  Active spinner already exists
└ (/home/runner/work/zarf/zarf/src/pkg/message/spinner.go:24)
  DEBUG   2023-01-17T22:54:56Z  -  loading image from docker daemon: 123456789123.dkr.ecr.us-east-1.amazonaws.com/mockingbird:dockerize
└ (/home/runner/work/zarf/zarf/src/internal/packager/images/pull.go:58)
  •  Loading 3 K8s manifests

Journalctl logs including kernel OOM logs. Also a mention of dockerd and a possibly relevant error.

journalctl
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: dockerd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-500
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: CPU: 1 PID: 872 Comm: dockerd Not tainted 6.0.10-300.fc37.x86_64 #1
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Call Trace:
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  <TASK>
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  dump_stack_lvl+0x44/0x5c
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  dump_header+0x4a/0x1ff
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  oom_kill_process.cold+0xb/0x10
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  out_of_memory+0x1be/0x4f0
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  __alloc_pages_slowpath.constprop.0+0xccf/0xe20
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  __alloc_pages+0x209/0x230
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  folio_alloc+0x17/0x50
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  __filemap_get_folio+0x149/0x440
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  filemap_fault+0x151/0x990
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  __do_fault+0x36/0x130
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  do_fault+0x1da/0x440
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  __handle_mm_fault+0x645/0xe10
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  ? __rseq_handle_notify_resume+0x96/0x460
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  handle_mm_fault+0xae/0x290
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  do_user_addr_fault+0x1cd/0x690
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  exc_page_fault+0x70/0x170
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  asm_exc_page_fault+0x22/0x30
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: RIP: 0033:0x5564f00c181d
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Code: Unable to access opcode bytes at RIP 0x5564f00c17f3.
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: RSP: 002b:00007f3a1d4abc20 EFLAGS: 00010202
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: RAX: 0000000000000000 RBX: 0000000000002710 RCX: 00005564f00c181d
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f3a1d4abc20
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: RBP: 00007f3a1d4abc30 R08: 000000000246963e R09: 0000000000000000
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007f3a1d4ab660
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: R13: 0000000000000016 R14: 000000c0000029c0 R15: 00007f3a1ccac000
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel:  </TASK>
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Mem-Info:
                                                     free:24840 free_pcp:62 free_cma:0
                                                     kernel_misc_reclaimable:0
                                                     mapped:2 shmem:7 pagetables:6378 bounce:0
                                                     slab_reclaimable:26368 slab_unreclaimable:32615
                                                     unevictable:0 dirty:0 writeback:0
                                                     active_file:53 inactive_file:6099 isolated_file:36
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: active_anon:346420 inactive_anon:436631 isolated_anon:0
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 active_anon:1385680kB inactive_anon:1746524kB active_file:312kB inactive_file:24076kB unevictable:0kB isolated(anon):0kB isolated(file):256kB mapped:8kB dirty:0kB writeback:0kB shmem:28kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 94208kB writeback_tmp:0kB kernel_stack:3312kB pagetables:25512kB all_unreclaimable? yes
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 DMA free:14336kB boost:0kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: lowmem_reserve[]: 0 2964 7597 7597 7597
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 DMA32 free:44396kB boost:0kB min:26320kB low:32900kB high:39480kB reserved_highatomic:0KB active_anon:595256kB inactive_anon:792880kB active_file:8kB inactive_file:10412kB unevictable:0kB writepending:0kB present:3129256kB managed:3063720kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: lowmem_reserve[]: 0 0 4632 4632 4632
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 Normal free:40628kB boost:0kB min:41128kB low:51408kB high:61688kB reserved_highatomic:0KB active_anon:790424kB inactive_anon:953644kB active_file:168kB inactive_file:13960kB unevictable:0kB writepending:0kB present:4935680kB managed:4752300kB mlocked:0kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 DMA32: 329*4kB (UME) 387*8kB (UME) 183*16kB (UME) 406*32kB (UME) 188*64kB (UME) 64*128kB (UME) 13*256kB (ME) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 44396kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 Normal: 281*4kB (UE) 384*8kB (UE) 953*16kB (UME) 210*32kB (UME) 36*64kB (UE) 57*128kB (UME) 15*256kB (UME) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 40628kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 6210 total pagecache pages
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 8 pages in swap cache
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Free swap  = 0kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Total swap = 7830524kB
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 2020232 pages RAM
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 0 pages HighMem/MovableOnly
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 62387 pages reserved
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 0 pages cma reserved
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: 0 pages hwpoisoned
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Tasks state (memory values in pages):
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    571]     0   571    60380       17   372736      330          -250 systemd-journal
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    596]     0   596     8117        5    86016     1072         -1000 systemd-udevd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    662]   999   662     4082       43    65536      208          -900 systemd-oomd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    663]   193   663     5752       15    77824      970             0 systemd-resolve
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    664]     0   664     3964        1    77824      219             0 systemd-userdbd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    665]     0   665     4471       17    57344      175         -1000 auditd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    692]    81   692     2301        0    57344      190          -900 dbus-broker-lau
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    693]    81   693     1360        0    53248      235          -900 dbus-broker
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    697]     0   697     4857       17    77824      888             0 systemd-logind
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    698]     0   698     4041       21    73728      235             0 systemd-machine
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    700]   996   700    21271       14    73728      163             0 chronyd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    724]     0   724    63716       41   126976      648             0 NetworkManager
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    737]     0   737    13699        3    81920      218             0 gssproxy
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    739]     0   739   336912      123   286720     3068          -999 containerd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    865]     0   865   380720       88   462848     6966          -500 dockerd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    867]     0   867     3507       21    61440      284         -1000 sshd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    869]     0   869      718       12    40960       27             0 agetty
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [    871]     0   871     1360        5    49152       41             0 agetty
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225511]     0 225511     4088        1    73728      454             0 sshd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225515]  1000 225515     5262       20    81920      779           100 systemd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225519]  1000 225519    44267        3   110592     2108           100 (sd-pam)
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225526]  1000 225526     4069        0    73728      464             0 sshd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225527]  1000 225527     1816        3    49152      166             0 bash
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225678]   998 225678   615834        0   208896     1453             0 polkitd
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225697]  1000 225697     1750        0    49152       66             0 runsvc.sh
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225699]  1000 225699    79473        0   401408     1546             0 node
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 225706]  1000 225706   923208      693   536576    13625             0 Runner.Listener
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 241956]  1000 241956   926408      932   532480    12723             0 Runner.Worker
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 242171]  1000 242171     1751        0    53248       74           500 bash
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 242173]  1000 242173  3348327   780200 21868544  1907483           500 zarf
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 242211]     0 242211     4051       17    73728      208             0 systemd-userwor
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 242214]     0 242214     4051        8    73728      211             0 systemd-userwor
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: [ 242225]     0 242225     4051       16    69632      208             0 systemd-userwor
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker.service,mems_allowed=0,global_oom,task_memcg=/system.slice/actions.runner.naps-dev.naps-dev-runner.service,task=zarf,pid=242173,uid=1000
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Out of memory: Killed process 242173 (zarf) total-vm:13393308kB, anon-rss:3120800kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:21356kB oom_score_adj:500
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal chronyd[700]: Forward time jump detected!
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal systemd[1]: actions.runner.naps-dev.naps-dev-runner.service: A process of this unit has been killed by the OOM killer.
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Shutting down runner listener
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Sending SIGINT to runner listener to stop
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Sending SIGKILL to runner listener
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal chronyd[700]: Can't synchronise: no selectable sources
Jan 18 00:36:53 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Exiting...
Jan 18 00:36:55 ip-999-999-999-999.ec2.internal kernel: oom_reaper: reaped process 242173 (zarf), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal dockerd[865]: time="2023-01-18T00:36:56.738782256Z" level=error msg="Can't add file /var/lib/docker/tmp/docker-export-2503784040/334ad8b9f8ae5746dedade7440101515d6b927666899deedea87137d7800098c/layer.tar to tar: io: read/write on closed pipe"
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal dockerd[865]: time="2023-01-18T00:36:56.738958290Z" level=error msg="Can't close tar writer: io: read/write on closed pipe"
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: 2023-01-18 00:36:56Z: Job convert-vm completed with result: Canceled
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Runner listener exited with error code 0
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal runsvc.sh[225699]: Runner listener exit with 0 return code, stop the service, no retry needed.
Jan 18 00:36:56 ip-999-999-999-999.ec2.internal systemd[1]: actions.runner.naps-dev.naps-dev-runner.service: Failed with result 'oom-kill'.

Severity/Priority

Additional Context

With Zarf v0.23.5, using the new feature "Add support for local docker image and image archives on package create" (#1173), I'm seeing OOM kills when it pulls really large images out of the local Docker daemon.

For context: I'm building KubeVirt virtual machine containerDisk images (~18 GB raw, 12GB after docker targz compression) in a GitHub actions pipeline. (Basically scratch Docker images that contain a VM image.) I then package the Docker image plus a Helm chart with KubeVirt manifests with Zarf. This new feature should be great, I can avoid needlessly re-pulling my just-built image. Except I'm getting OOM killed.

I quickly saw Zarf use around 4.5GB memory usage and climb until it was OOM killed. It completes without issue if I pass --no-local-images.

@jeff-mccoy
Copy link
Contributor

Thanks for the info, I also found that the docker daemon doesn't really give you useful logs either when you're pulling it this way. We did notice some inexplicable front loading performance hits we couldn't explain (took a few extra seconds than expected to start pulling the image from the daemon). This might just be an inefficiency with how crane is interacting with the docker daemon :-<.

For now, to regain existing functionality, you can keep passing the flag or set the field package.create.no_local_images in the zarf-config.toml to no longer have to specify the flag.

I think we'll need to do some testing to reproduce this, I assume it's just as bad with docker save and referencing it that way? Guessing yes, looks like crane is trying to load the whole image into memory (rather inefficiently at that):

Jan 18 00:36:53 ip-999-999-999-999.ec2.internal kernel: Out of memory: Killed process 242173 (zarf) total-vm:13393308kB, anon-rss:3120800kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:21356kB oom_score_adj:500

It does look like swap is 0, I wonder if we could play with turning it on so it has room to grow, just to see what it does and how it compares to the crane caching/pulling performance.

@corang
Copy link
Contributor

corang commented Jan 18, 2023

di2me is a reliable way to get zarf to be OOM killed if all of the images exist in the docker cache. Bash snippet below to pull all of the images into docker from the zarf.yaml, adjust the -P flag to how many instances of docker you want to run at once (faster downloads but risk getting rate limited)
yq e '.components.[].images' zarf.yaml | grep '-' | yq 'unique' | sed 's/- //g' | xargs -n 1 -P 1 docker pull

@Racer159
Copy link
Contributor

It is specifically this line that it runs when it initializes the image and then loads the tarball into memory: https://github.com/google/go-containerregistry/blob/main/pkg/v1/daemon/image.go#L56

@Racer159
Copy link
Contributor

I do not believe this would happen on a local tarball since that instead uses golang syscalls to do the work without loading everything into memory.

@jeff-mccoy
Copy link
Contributor

I think we'll need to whip up a super simple golang example of this breaking in crane and then open an issue on https://github.com/google/go-containerregistry/issues

@jeff-mccoy
Copy link
Contributor

Did some testing on this over the weekend, wanted to confirm there wasn't anything else gross/odd with our setup and pulled the code into a single go file. Tested with both tar method we use today and OCI and had very similar results. Looks like we can avoid the memory issue by running with the unbuffered option, but it is also much slower 😭.

I tested a 20 gig image, docker save took 2.5 mins, buffered took 7.5 mins, didn't even let unbuffered finish.

On a 1 gig image, buffered took 18s, unbuffered took 1m20s, but memory was also substantially better. Actually not sure where this leaves us except maybe crane's implementation isn't the best one and a fallback to docker save might actually be more sane here.

Screenshot 2023-01-23 at 10 22 25 AM

Buffered memory:
Screenshot 2023-01-23 at 10 20 03 AM

Unbuffered memory:
Screenshot 2023-01-23 at 10 21 53 AM

Stats from 20 gig image buffered as it ran:
Screenshot 2023-01-22 at 3 13 37 AM
Screenshot 2023-01-22 at 2 44 19 AM
Screenshot 2023-01-22 at 2 44 10 AM

Reference dockerfile:

FROM alpine:latest
RUN dd if=/dev/urandom of=test.bin bs=1M count=20K

Reference main.go:

package main

import (
	"errors"
	"fmt"
	"io"
	"os"

	"github.com/defenseunicorns/zarf/src/pkg/message"
	"github.com/defenseunicorns/zarf/src/pkg/utils"
	"github.com/google/go-containerregistry/pkg/logs"
	"github.com/google/go-containerregistry/pkg/name"
	v1 "github.com/google/go-containerregistry/pkg/v1"
	"github.com/google/go-containerregistry/pkg/v1/daemon"
	"github.com/google/go-containerregistry/pkg/v1/tarball"
)

var (
	images = os.Args[1:]
	path   = "crane-image-test.tar"
)

func main() {
	// Logging setup
	message.SetLogLevel(message.TraceLevel)
	spinner := message.NewProgressSpinner("Loading %d images", len(images))
	defer spinner.Stop()

	logs.Warn.SetOutput(spinner)
	logs.Progress.SetOutput(spinner)

	// Create a map of the image to be written to the tarball
	refToImage := map[name.Reference]v1.Image{}
	imgMap := make(map[string]v1.Image)

	for _, image := range images {
		spinner.Updatef("Loading image %s", image)

		// Parse the image reference
		ref, err := name.ParseReference(image)
		if err != nil {
			spinner.Fatal(err)
		}

		// Load the image from the local docker daemon
		img, err := daemon.Image(ref, daemon.WithUnbufferedOpener())
		if err != nil {
			spinner.Fatal(err)
		}

		refToImage[ref] = img
		imgMap[image] = img
	}

	spinner.Success()

	// Progress bar setup
	var (
		progress    = make(chan v1.Update, 200)
		progressBar *message.ProgressBar
		title       string
	)

	// OCI variant
	// p, err := layout.FromPath("test-oci")
	// if err != nil {
	// 	p, err = layout.Write("test-oci", empty.Index)
	// 	if err != nil {
	// 		spinner.Fatal(err)
	// 	}
	// }

	// for _, img := range refToImage {
	// 	if err = p.AppendImage(img); err != nil {
	// 		spinner.Fatal(err)
	// 	}
	// }

	go func() {
		_ = tarball.MultiRefWriteToFile(path, refToImage, tarball.WithProgress(progress))
	}()

	for update := range progress {
		switch {
		case update.Error != nil && errors.Is(update.Error, io.EOF):
			progressBar.Success("Pulling %d image (%s)", len(images), utils.ByteFormat(float64(update.Total), 2))
			return

		case update.Error != nil:
			spinner.Fatal(update.Error)
		default:
			title = fmt.Sprintf("Pulling %d images (%s of %s)", len(images),
				utils.ByteFormat(float64(update.Complete), 2),
				utils.ByteFormat(float64(update.Total), 2),
			)
			if progressBar == nil {
				progressBar = message.NewProgressBar(update.Total, title)
			}
			progressBar.Update(update.Complete, title)
		}
	}

}

jeff-mccoy added a commit that referenced this issue Feb 8, 2023
## Description

Updates local image behavior, changing back to crane first with docker
as an optional fallback when needed. Large images from docker will still
be slow, but should not cause OOM issues and a new warning was added to
communicate an alternate approach to speed up local dev with large
images. Also cleans up some of the exec package presentation logic.

## Related Issue

Fixes #1214 
Fixes #1267

## Type of change

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [ ] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed

---------

Co-authored-by: Wayne Starr <[email protected]>
@github-project-automation github-project-automation bot moved this from New Requests to Done in Zarf Project Board Feb 8, 2023
Noxsios pushed a commit that referenced this issue Mar 8, 2023
## Description

Updates local image behavior, changing back to crane first with docker
as an optional fallback when needed. Large images from docker will still
be slow, but should not cause OOM issues and a new warning was added to
communicate an alternate approach to speed up local dev with large
images. Also cleans up some of the exec package presentation logic.

## Related Issue

Fixes #1214 
Fixes #1267

## Type of change

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [ ] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed

---------

Co-authored-by: Wayne Starr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants