-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Per-image pull locking for bandwidth efficiency #1911
Comments
@mtrmac PTAL |
Well, this is where #NOBIGFATDAEMONS makes things more difficult :)
|
I was imagining flocks on |
@vrothberg Do we have this now? |
Anything to improve performance, I am always interested in |
I can start looking into this in the next sprint 👍 |
@vrothberg Could you update this issue with your progress. |
Sure. The blob-locking mechanism in containers/storage is done. Each blob, when being copied into containers-storage, will receive a dedicated lock file in the storage driver's tmp directory. That's the central mechanism for serialization and synchronization. The progress-bar library received some backported features we needed to update the bars on the fly, and we're already making use of them. Currently, we are working on rewriting the backend code for containers-storage a bit in order write the layer to storage in Note that I lost at least one working week cleaning up breaking builds and tests (unrelated to those PRs) when trying to test the PRs in buildah, libpod and cri-o (and had to do this multiple times). |
what is the latest on this @vrothberg |
Plenty of tears and sweat over here: containers/image#611 It's working and CI is green but it turned into something really big, so I expect reviewing to still take a while. |
going to close, reopen if needed |
Reponing as it's still valid and the PR a c/image has stalled. |
@baude @rhatdan if that's still a desired feature, we should revive containers/image#611 and prioritize it or get it on our radar again. |
Yes I think this is something we should fix. |
This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days. |
Still something desirable but no progress. I'll add the label. |
Ping to bring this back to peoples consciousness. |
Needs a priority :^) |
cri-o/cri-o#3409 fixed this issue one level up, for folks who are coming at this down that pipe. As mentioned there, there will be continued work on getting a fix down lower in the stack for other containers/image consumers, but fixing at those levels is more complicated. |
We want to tackle this item this year. We broke it into separate pieces and I am positive we'll get it done. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
@vrothberg did we get some of this with your containers/image fixes? |
No, we got the first part addressed. The blob-copy detection is not yet done. |
A friendly reminder that this issue had no activity for 30 days. |
Still open. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
Still desired but we need to plan for this work since it'll consume some time. |
A friendly reminder that this issue had no activity for 30 days. |
Hey, thanks for your work on this issue. Can we get an update on the current progress? We're running into this issue on a self-hosted Gitlab-runner instance with a pre-baked development image. Since there's no real way of communicating between different jobs in gitlab runner, a manual lock is sadly no solution for us. |
[//]: # kind feature
Description
If there is an existing libpod pull in flight for a remote image, new pulls of that image should block until the in-flight pull completes (it may error out) to avoid shipping the same bits over the network twice.
Steps to reproduce the issue:
In one terminal:
In another terminal, launched once blob sha256:9bfce... is maybe 10MB into it's pull:
Describe the results you received:
As you can see from the console output, both commands seem to have pulled both layers in parallel.
Describe the results you expected:
I'd rather have seen the second command print a message about blocking on an existing pull, idle while that pull went through, and then run the command using the blobs pushed into local storage by that first pull.
Additional information you deem important (e.g. issue happens only occasionally):
My end goal is to front-load image pulls for a script that uses several images. Something like:
That way,
image2
andimage3
can be trickling in over a slow network while I'm spending CPU time running theimage1
container, etc.I don't really care about locking parallel manifest pulls, etc., because those are small; this just has to be for layers (possibly only for layers over a given size threshold). Of course, I'm fine with manifest/config locking if it's easier to just drop the same locking logic onto all CAS blobs. It doesn't have to be coordinated across multiple layers either. If process 1 ends up pulling layer 1, and then process 2 comes along, sees the lock on layer 1, and decides to pull layer 2 while it's waiting for the lock to lift on layer 1, that's fine. Process 1 might find layer 2 locked when it gets around to it, and they may end up leapfrogging through the layer stack. That means individual layers might come down a bit more slowly, which would have a negative impact on time-to-launch if you were limited by unpack-time. But I imagine most cases will be limited by network bandwidth, so unpacking-time delays wouldn't be a big deal.
Locking would allow for a denial of service attack by a user on the same machine with access to the lock you'd use, because they could acquire the lock for a particular layer and then idle without actually working to pull that layer down. I'm not concerned about that in my own usage, but we might want the locks to be soft, and have the caller be able to shove in and start pulling in parallel anyway if they get tired of waiting (you could scale the wait time by blob size, after guessing at some reasonable bandwidth value?).
And I realize that this is probably mostly an issue for one of libpod's dependencies, but I haven't spend the time to track down this one, and there didn't seem to be an existing placeholder issue in this repo. Please link me to the upstream issue if this already has a placeholder there.
Output of
podman version
:Output of
podman info
:The text was updated successfully, but these errors were encountered: