feature: Per-image pull locking for bandwidth efficiency #1911

wking · 2018-12-01T06:00:05Z

[//]: # kind feature

Description

If there is an existing libpod pull in flight for a remote image, new pulls of that image should block until the in-flight pull completes (it may error out) to avoid shipping the same bits over the network twice.

Steps to reproduce the issue:

In one terminal:

$ podman pull docker.io/library/centos:6
Trying to pull docker.io/library/centos:6...Getting image source signatures
Copying blob sha256:9bfcefca2b8da38bbfb8b6178a75f05245688b83fda45578bcdf51f56e4a5a9e
 66.60 MB / 66.60 MB [=====================================================] 13s
Copying config sha256:0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf
 2.60 KB / 2.60 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf

In another terminal, launched once blob sha256:9bfce... is maybe 10MB into it's pull:

$ podman run --rm docker.io/library/centos:6 echo hi
Trying to pull docker.io/library/centos:6...Getting image source signatures
Copying blob sha256:9bfcefca2b8da38bbfb8b6178a75f05245688b83fda45578bcdf51f56e4a5a9e
 66.60 MB / 66.60 MB [======================================================] 8s
Copying config sha256:0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf
 2.60 KB / 2.60 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
hi

Describe the results you received:

As you can see from the console output, both commands seem to have pulled both layers in parallel.

Describe the results you expected:

I'd rather have seen the second command print a message about blocking on an existing pull, idle while that pull went through, and then run the command using the blobs pushed into local storage by that first pull.

Additional information you deem important (e.g. issue happens only occasionally):

My end goal is to front-load image pulls for a script that uses several images. Something like:

podman pull image2 image3 &
podman run image1 some stuff
podman run image2 other stuff
wait
podman run image3 more stuff

That way, image2 and image3 can be trickling in over a slow network while I'm spending CPU time running the image1 container, etc.

I don't really care about locking parallel manifest pulls, etc., because those are small; this just has to be for layers (possibly only for layers over a given size threshold). Of course, I'm fine with manifest/config locking if it's easier to just drop the same locking logic onto all CAS blobs. It doesn't have to be coordinated across multiple layers either. If process 1 ends up pulling layer 1, and then process 2 comes along, sees the lock on layer 1, and decides to pull layer 2 while it's waiting for the lock to lift on layer 1, that's fine. Process 1 might find layer 2 locked when it gets around to it, and they may end up leapfrogging through the layer stack. That means individual layers might come down a bit more slowly, which would have a negative impact on time-to-launch if you were limited by unpack-time. But I imagine most cases will be limited by network bandwidth, so unpacking-time delays wouldn't be a big deal.

Locking would allow for a denial of service attack by a user on the same machine with access to the lock you'd use, because they could acquire the lock for a particular layer and then idle without actually working to pull that layer down. I'm not concerned about that in my own usage, but we might want the locks to be soft, and have the caller be able to shove in and start pulling in parallel anyway if they get tired of waiting (you could scale the wait time by blob size, after guessing at some reasonable bandwidth value?).

And I realize that this is probably mostly an issue for one of libpod's dependencies, but I haven't spend the time to track down this one, and there didn't seem to be an existing placeholder issue in this repo. Please link me to the upstream issue if this already has a placeholder there.

Output of podman version:

$ podman version
Version:       0.11.2-dev
Go Version:    go1.10.3
Git Commit:    "6df7409cb5a41c710164c42ed35e33b28f3f7214"
Built:         Fri Nov 30 21:32:33 2018
OS/Arch:       linux/amd64

Output of podman info:

$ podman info
host:
  BuildahVersion: 1.5-dev
  Conmon:
    package: podman-0.10.1.3-1.gitdb08685.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.12.0-dev, commit: 0e7abf050824203d775a20c26ed14f716f7d1aa7-dirty'
  Distribution:
    distribution: '"rhel"'
    version: "7.5"
  MemFree: 9789857792
  MemTotal: 16289890304
  OCIRuntime:
    package: runc-1.0.0-52.dev.git70ca035.el7_5.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 6291070976
  SwapTotal: 8254386176
  arch: amd64
  cpus: 8
  hostname: trking.remote.csb
  kernel: 3.10.0-891.el7.x86_64
  os: linux
  rootless: false
  uptime: 986h 7m 57.59s (Approximately 41.08 days)
insecure registries:
  registries: []
registries:
  registries:
  - registry.access.redhat.com
store:
  ContainerStore:
    number: 10
  GraphDriverName: overlay
  GraphOptions:
  - overlay.override_kernel_check=true
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 12
  RunRoot: /var/run/containers/storage

The text was updated successfully, but these errors were encountered:

rhatdan · 2018-12-01T12:04:14Z

@mtrmac PTAL
I would figure this would have to be in containers/image since doing this for just podman would allow conflict on CRI-O, Skopeo and Buildah.

mtrmac · 2018-12-01T17:03:14Z

Well, this is where #NOBIGFATDAEMONS makes things more difficult :)

We would need some kind of inter-process locking mechanism, like the one being built for container state — but with a much larger namespace (blob digests), so we can’t just use a preallocated dense integer table. Maybe that involves a separate, separately-locked, map from digests to lock numbers, or something like that (i.e. it is doable in principle, but better implementation ideas would be nice). That locking mechanism probably should live also, or be very close to, containers/storage, as the data structure being actually locked.
It’s not quite obvious which digest to lock on:
- We can lock on the compressed digest, but that won’t prevent concurrent pulls of differently-compressed identical layers.
- We can lock on the uncompressed guest if it is known (e.g. from the blob info cache), but that won’t always be the case
- Locking both is more risky WRT race conditions and the like.
  I guess the compressed digest would be a starting point.
As far as two pulls “racing each other”, that would require layer copy parallelization; right now the code is sequential, so blocking on a lock would just block the pull until the other process finishes. (Sure, that’s still an improvement because the data is only downloaded and uncompressed once.)

wking · 2018-12-01T17:22:59Z

Maybe that involves a separate, separately-locked, map from digests to lock numbers, or something like that...

I was imagining flocks on ${PATH_TO_EVENTUAL_BLOB_FILE}.lock or some such. While the lock space is large, there will probably never be more than a few dozen acquired at any one time.

rhatdan · 2019-03-08T15:01:28Z

@vrothberg Do we have this now?

vrothberg · 2019-03-08T15:07:37Z

There was a very similar discussion recently over at #2551.

@rhatdan, if you want this to work, we can work on that but we need to allocate a bigger chunk of time.

rhatdan · 2019-03-08T15:27:40Z

Anything to improve performance, I am always interested in

vrothberg · 2019-03-08T15:37:03Z

I can start looking into this in the next sprint 👍

rhatdan · 2019-04-13T09:08:22Z

@vrothberg Could you update this issue with your progress.

vrothberg · 2019-04-14T09:43:28Z

Sure. The blob-locking mechanism in containers/storage is done. Each blob, when being copied into containers-storage, will receive a dedicated lock file in the storage driver's tmp directory. That's the central mechanism for serialization and synchronization.

The progress-bar library received some backported features we needed to update the bars on the fly, and we're already making use of them.

Currently, we are working on rewriting the backend code for containers-storage a bit in order write the layer to storage in ImageDestination.PutBlob(...). Otherwise, the code is subject to deadlocks (ABBA). The CI's for this rewrite are now green, so we can reuse this work for the copy detection.

Note that I lost at least one working week cleaning up breaking builds and tests (unrelated to those PRs) when trying to test the PRs in buildah, libpod and cri-o (and had to do this multiple times).

baude · 2019-05-29T14:30:18Z

what is the latest on this @vrothberg

vrothberg · 2019-05-29T14:35:17Z

Plenty of tears and sweat over here: containers/image#611

It's working and CI is green but it turned into something really big, so I expect reviewing to still take a while.

baude · 2019-08-02T16:09:00Z

going to close, reopen if needed

vrothberg · 2019-08-02T16:12:15Z

Reponing as it's still valid and the PR a c/image has stalled.

vrothberg · 2019-10-02T12:51:47Z

@baude @rhatdan if that's still a desired feature, we should revive containers/image#611 and prioritize it or get it on our radar again.

rhatdan · 2019-10-02T13:45:27Z

Yes I think this is something we should fix.

github-actions · 2019-11-05T00:06:48Z

This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.

vrothberg · 2019-11-05T06:40:49Z

Still something desirable but no progress. I'll add the label.

rhatdan · 2020-02-17T20:10:35Z

Ping to bring this back to peoples consciousness.

vrothberg · 2020-02-18T09:02:05Z

Needs a priority :^)

wking · 2020-05-18T23:56:50Z

cri-o/cri-o#3409 fixed this issue one level up, for folks who are coming at this down that pipe. As mentioned there, there will be continued work on getting a fix down lower in the stack for other containers/image consumers, but fixing at those levels is more complicated.

vrothberg · 2021-02-11T13:25:53Z

We want to tackle this item this year. We broke it into separate pieces and I am positive we'll get it done.

github-actions · 2021-03-14T00:18:05Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-04-15T00:07:31Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2021-04-15T13:26:39Z

@vrothberg did we get some of this with your containers/image fixes?

vrothberg · 2021-04-15T13:38:45Z

@vrothberg did we get some of this with your containers/image fixes?

No, we got the first part addressed. The blob-copy detection is not yet done.

github-actions · 2021-05-16T00:05:24Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2021-05-17T13:13:14Z

Still open.

github-actions · 2021-07-17T00:03:12Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-08-25T00:03:36Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-10-19T00:03:47Z

A friendly reminder that this issue had no activity for 30 days.

vrothberg · 2021-10-19T07:09:05Z

Still desired but we need to plan for this work since it'll consume some time.

github-actions · 2022-01-19T00:05:20Z

A friendly reminder that this issue had no activity for 30 days.

Nukesor · 2024-05-14T18:27:39Z

Hey, thanks for your work on this issue. Can we get an update on the current progress?

We're running into this issue on a self-hosted Gitlab-runner instance with a pre-baked development image.
That image is pretty big (~7GB) and once a new pipeline starts, all 20 jobs try to pull the same image at the same time, completely congesting the link.

Since there's no real way of communicating between different jobs in gitlab runner, a manual lock is sadly no solution for us.

mheon added the enhancement label Dec 4, 2018

rhatdan assigned vrothberg Mar 8, 2019

This was referenced Mar 22, 2019

redundant image pull shall be avoided (rootless & root) #2741

Closed

podman: single pull for parallel run #2551

Closed

wking mentioned this issue Apr 4, 2019

Do not merge - Copy detection containers/image#607

Closed

baude closed this as completed Aug 2, 2019

vrothberg reopened this Aug 2, 2019

wking mentioned this issue Aug 20, 2019

Bug 1743873: data/bootstrap: extract ironic images from release image openshift/installer#2234

Merged

github-actions bot added the stale-issue label Nov 5, 2019

vrothberg added the do-not-close label Nov 5, 2019

github-actions bot added the stale-issue label Mar 14, 2021

rhatdan removed the stale-issue label Mar 15, 2021

github-actions bot added the stale-issue label Apr 15, 2021

rhatdan removed the stale-issue label Apr 15, 2021

github-actions bot added the stale-issue label May 16, 2021

rhatdan removed the stale-issue label May 17, 2021

github-actions bot added the stale-issue label Jul 17, 2021

rhatdan removed the stale-issue label Jul 19, 2021

github-actions bot added the stale-issue label Aug 25, 2021

rhatdan removed the stale-issue label Aug 25, 2021

github-actions bot added the stale-issue label Oct 19, 2021

vrothberg removed the stale-issue label Oct 19, 2021

github-actions bot added the stale-issue label Jan 19, 2022

rhatdan removed the stale-issue label Jan 19, 2022

towe75 mentioned this issue Mar 3, 2022

Prevent concurrent image pulls of same imageRef hashicorp/nomad-driver-podman#159

Merged

umohnani8 added the priority/medium label Jul 13, 2022

vrothberg removed the priority/medium label Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Per-image pull locking for bandwidth efficiency #1911

feature: Per-image pull locking for bandwidth efficiency #1911

wking commented Dec 1, 2018 •

edited

Loading

rhatdan commented Dec 1, 2018

mtrmac commented Dec 1, 2018

wking commented Dec 1, 2018

rhatdan commented Mar 8, 2019

vrothberg commented Mar 8, 2019

rhatdan commented Mar 8, 2019

vrothberg commented Mar 8, 2019

rhatdan commented Apr 13, 2019

vrothberg commented Apr 14, 2019

baude commented May 29, 2019

vrothberg commented May 29, 2019

baude commented Aug 2, 2019

vrothberg commented Aug 2, 2019

vrothberg commented Oct 2, 2019

rhatdan commented Oct 2, 2019

github-actions bot commented Nov 5, 2019

vrothberg commented Nov 5, 2019

rhatdan commented Feb 17, 2020

vrothberg commented Feb 18, 2020

wking commented May 18, 2020

vrothberg commented Feb 11, 2021

github-actions bot commented Mar 14, 2021

github-actions bot commented Apr 15, 2021

rhatdan commented Apr 15, 2021

vrothberg commented Apr 15, 2021

github-actions bot commented May 16, 2021

rhatdan commented May 17, 2021

github-actions bot commented Jul 17, 2021

github-actions bot commented Aug 25, 2021

github-actions bot commented Oct 19, 2021

vrothberg commented Oct 19, 2021

github-actions bot commented Jan 19, 2022

Nukesor commented May 14, 2024

feature: Per-image pull locking for bandwidth efficiency #1911

feature: Per-image pull locking for bandwidth efficiency #1911

Comments

wking commented Dec 1, 2018 • edited Loading

rhatdan commented Dec 1, 2018

mtrmac commented Dec 1, 2018

wking commented Dec 1, 2018

rhatdan commented Mar 8, 2019

vrothberg commented Mar 8, 2019

rhatdan commented Mar 8, 2019

vrothberg commented Mar 8, 2019

rhatdan commented Apr 13, 2019

vrothberg commented Apr 14, 2019

baude commented May 29, 2019

vrothberg commented May 29, 2019

baude commented Aug 2, 2019

vrothberg commented Aug 2, 2019

vrothberg commented Oct 2, 2019

rhatdan commented Oct 2, 2019

github-actions bot commented Nov 5, 2019

vrothberg commented Nov 5, 2019

rhatdan commented Feb 17, 2020

vrothberg commented Feb 18, 2020

wking commented May 18, 2020

vrothberg commented Feb 11, 2021

github-actions bot commented Mar 14, 2021

github-actions bot commented Apr 15, 2021

rhatdan commented Apr 15, 2021

vrothberg commented Apr 15, 2021

github-actions bot commented May 16, 2021

rhatdan commented May 17, 2021

github-actions bot commented Jul 17, 2021

github-actions bot commented Aug 25, 2021

github-actions bot commented Oct 19, 2021

vrothberg commented Oct 19, 2021

github-actions bot commented Jan 19, 2022

Nukesor commented May 14, 2024

wking commented Dec 1, 2018 •

edited

Loading