Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jib should avoid parallel image downloads for same image #2007

Open
briandealwis opened this issue Sep 16, 2019 · 16 comments
Open

Jib should avoid parallel image downloads for same image #2007

briandealwis opened this issue Sep 16, 2019 · 16 comments

Comments

@briandealwis
Copy link
Member

Although jib is thread-safe, it should be smarter.

Jib doesn't currently lock the base image cache when downloading a base image, but instead downloads into a temporary directory and then attempts to moves the downloaded image into place. There is no locking to block other threads. So a Maven project with N modules that build images using the same base image (like gcr.io/distroless/java) may result in N simultaneous pulls of the same image. Maybe we should provide a component to centralize downloading images?

Originally posted by @briandealwis in #1904 (comment)

@raizoor
Copy link

raizoor commented Sep 27, 2019

@briandealwis ,

In version >= 1.5.0 JIB verifys if the base image exists on target registry and not download, doesn't it? If it's correct, the centralized folder to base images is only necessary when you don't use a registry.

@TadCordle
Copy link
Contributor

TadCordle commented Sep 27, 2019

That is true, but when you don't target a registry, Jib still caches base images locally to skip unnecessary pulls. So I think this issue is really only a problem for clean multi-module builds.

@chanseokoh
Copy link
Member

And this only matters when you enable the Maven parallel builds (-T <thread counts>) or run multiple Jib processes simultaneously on the same host. (And only when the base image layers have never been cached.) Does not affect correctness even in those cases.

@mpeddada1
Copy link
Contributor

@chanseokoh assuming this can be closed after #2780?

@chanseokoh
Copy link
Member

@mpeddada1 this issue is a different one. #2780 was only in the context of multi-arch support.

@nhoughto
Copy link

I've been trying to incorporate Jib into our build process and one problem i'm encountering is that Jib doesn't seem to try and detect and de-dupe inflight build work. So if N requests to build the same new image come in at the same time, it will do the work N times (because the base/application cache is a cache miss).

Obviously the cache directories exist to solve for avoiding work when the work has already been done, but in the scenario where the work hasn't been done yet but many requests to do the same thing are being processed concurrently there is potential for detecting and optimising this scenario I think?

My scenario is a monorepo with N images with the base/first few layers with only the last layer differing per image, so a request to build all of them at once means jib builds the base/shared layers N times rather than once, which makes things take a lot longer than when doing the same thing under docker as it de-dupes work.

My plan was to coordinate the calls to jib such that 1 build request goes in first, so the subsequent requests would get cache hits, but that is really just solving my problem and I thought there might be a desire to solve this in jib itself as this is likely smth others might be hitting? This isn't just related to image downloads, but the full set of things Jib might do..

There is a secondary question about how this might work with maven/gradle plugins as they make things a bit harder (could be different JVMs and thus might not be able to just coordinate intra-JVM..) but ignore that for now.

@chanseokoh
Copy link
Member

chanseokoh commented Dec 14, 2021

@nhoughto thanks for the feedback. As mentioned in this issue, what you said makes sense, and this is something not ideal. It's only that it would be a pretty complex task to address the issue for multiple reasons, which may require a substantial overhaul of the async infrastructure (unless there are some quick hacks to enable a few relatively easy performance enhancements). But more importantly, I remember I gave some thoughts on this a long time ago, and de-duping some in-flight build work is like a fundamental design change that brings both pros and cons: notably, the current implementation is that all the lines of parallel work are very fine-grained, which enables great parallelism. For example, Jib can start uploading part of base images or application binaries while it is still downloading other part of base images or still building other application layers. So, in many cases this can be much faster compared to centralized locking to delay/block all threads until Jib verifies that all the layers these threads download or build are eventually going to be identical. So, the issue really needs deep insights, although there might be some easy areas for improvement. Unfortunately, we have other priorities, and improving this aspect is not in our roadmap.

@nhoughto
Copy link

nhoughto commented Dec 14, 2021 via email

@nhoughto
Copy link

Digging into my scenarios for this ticket more, the slowness I was seeing was really due to the DockerDaemonImage behaviour more than anything, the base and app cache is pretty effective at everything other than the first run but the Docker Daemon integrate is very wasteful if you are coming from a docker build .. world and switching to jib.

Because DockerDaemonImage ends up just calling docker load -i image.tar essentially there is zero dedupe / cache checking like there would be if you did docker build .. and the full image is passed to the docker daemon each time, in my scenario where I have N containers being built concurrently sharing much of the layers this results in zero layer reuse and N full tar pushed to the daemon, making even small changes incur a large penalty. Not much jib can do about that, but maybe a note on the DockerDaemonImage that it is functional but has some caveats.

To workaround it i've effectively written a JibContainerBuilder -> Dockerfile serialiser, so am back to indirectly using docker build . where required and jib everywhere else (from one JibContainerBuilder definition).

@elefeint
Copy link
Contributor

Thank you for the investigation and detailed notes!

@chanseokoh
Copy link
Member

docker run -d -p 5000:5000 --restart always --name registry registry:2

Because DockerDaemonImage ends up just calling docker load -i image.tar essentially there is zero dedupe / cache checking like there would be if you did docker build .. and the full image is passed to the docker daemon each time

Yeah, this is very unfortunate. Unlike container registries, the Docker daemon (Docker Engine API) is very limited in that it doesn't provide a way to check, pull, or push individual layers. It's what we've been greatly lamented. OTOH, pushing to a registry with Jib is super-fast due to the strong reproducibility of Jib, so Spinning up a local registry could be an option, which is as easy as docker run -p 5000:5000 registry:2.

@nhoughto
Copy link

nhoughto commented Jan 7, 2022

More findings on the actual original issue of this, parallel downloads of the same base image, this is actually causing me failures rather than just wasting cycles/bandwidth. Specifically against docker hub with a parallelism of 8 (and more than 8 projects sharing the same base image) some of the base image downloads would fail with an UNAUTHORIZED error whilst others would succeed, leading to an overall failure of the build.
Once the base image is there obviously this isn't a problem, but when its not it causes the build to fail consistently.

I'm not sure whether this is a docker hub API thing, rate limiting or similar? Or its a Jib race condition around authentication (is anything shared across threads/instances in Jib?) the logs don't show any errors relating to rate limiting. Feels a bit like its a Jib race condition.

Either way it seems like there are meaningful problems with letting Jib download the base image in parallel and doing some de-duping will solve both problems, as a workaround atm i'm doing some locking before calls to Jib to ensure each base image is only pulled once but it would be nice if this was solved upstream 👍

@meltsufin
Copy link
Contributor

@nhoughto Would you be interested in making a contribution for this?

@nhoughto
Copy link

nhoughto commented Jan 8, 2022

yep can do, any tips on preferred approach?

@chanseokoh
Copy link
Member

chanseokoh commented Jan 10, 2022

You can see detailed registry interactions by enabling HTTP debug logging. (I think you can still keep -Djib.serialize=true, since I believe what you are doing is running multiple Jib builds concurrency.) That way, you should be able to pinpoint where and when exactly the server return UNAUTHORIZED and what was the auth information given to the server.

BTW, as I alluded in #2007 (comment), this isn't something we can jump into work on it.

@chanseokoh
Copy link
Member

chanseokoh commented Aug 18, 2022

@nhoughto apologies, it's been a while, but your observation was keen and correct. I totally forgot the following known issue:

// TODO: passing noAuthRegistryClient may be problematic. It may return 401 unauthorized
// if layers have to be downloaded.
// https://github.com/GoogleContainerTools/jib/issues/2220
return new ImagesAndRegistryClient(images, noAuthRegistryClient);

According to another user's analysis, it seems that you may run into the UNAUTHORIZED issue when a base image is not yet cached and there are parallel downloads going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants