-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash with --experimental_remote_download_outputs=toplevel
#8508
Comments
@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no? |
Regardless of retries what's the expected behavior in this case?
…--
Keith Smiley
On Jun 3, 2019, at 08:07, Jakob Buchgraber ***@***.***> wrote:
@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@keith to fallback to local execution. it's a bug either way. |
I can't reproduce this with 0.27.1 ... can you?
|
Do you have a specific fix in mind that would have addressed this? I'll have to merge this again on our side to test well enough to see if it breaks again |
It looks like I can't test this in our case because we hit this:
So if you think this is fixed on your side I guess we can close this |
@keith in order to remove this error you need to clear your remote cache. which remote cache are you using? |
This is from an internal cache implementation. But in theory anything could be evicted from other caches during a build as well right? |
@buchgr do you know how bazel would handle errors from the cache, even if the key itself hasn't been evicted, it's just temporarily inaccessible? Specifically if we returned a 500 in this case would this be better than returning a 404? |
@buchgr I enabled this flag set on envoy and we've started seeing this crash as well https://circleci.com/gh/envoyproxy/envoy/249586 we're pointing to the GCS remote cache. Any ideas on how we can mitigate this? |
@keith disable garbage collection on GCS. The issue is that GCS does not understand the action graph of what it's caching and evicts entries purely based on time. So it might have a cache entry for an action A1 but deleted all the outputs of this action. If an action A2 now needs the outputs of A1, then Bazel can't download them and also can't re-run action A1 and thus has to print this error. I ll have to write a blog post about this and discussing potential ways for mitigation. For GCS specifically the only currently viable strategy is to disable garbage collection and wipe the whole cache from time to time. I imagine it would be easy enough to write a cloud function (or something) that reads in the action graph from GCS and properly evicts items from time to time. |
Pretty sure this is still a problem FWIW |
Thanks Keith, we should probably document what Jakob described. |
Is there any active work on bits without the bytes to make it gracefully handle download errors like the normal build does? I just uprev'd to v3.4.1 and tried experimenting with this feature ( It is not a garbage collection issue in my case because we only cleanup on a cron once/month. Although I suspect it is a similar problem, for example:
|
I'm collecting all issues for build-without-the-bytes in the tracking issue (#6862). I'm making slow progress getting through the backlog. If you have issues that are easy to repro, I can take a look. |
I am unable to reproduce this issue. |
FYI I still see a similar stack to one above here:
but I can file a new issue if you think it's unrelated |
Is this with the gRPC remote cache? AFAIK, the HTTP remote cache only supports HTTP 1.1. Also, can you provide the full stack trace? I think the interesting part is below where you cut it off. |
This is with GCP RBE via gRPC. Super long log: https://gist.github.com/keith/53adb5908b9419443d682398d7dcb832 |
This seems to be working as expected from Bazel's point of view. You should bring this up with, err, RBE team? @EricBurnett @bergsieker So, I created a remote execution service that artificially causes GOAWAY like the one you are seeing. Bazel retries the error 5 times (for a total of 6 calls) with exponentially increasing delays between calls. The default exponential backoff configuration is still fairly aggressive, with delays of 100ms, 200ms, 400ms, 800ms, and 1600ms. If the remote system is overloaded for a period of time, then it is not entirely surprising if it returns GOAWAY 6 times in ~3.1 seconds. If this is a common issue, then it would look like a service misconfiguration to me. For example, the service could reduce the number of concurrent streams the client is allowed to open. Or it could use flow control to reduce the rate of incoming packets. We could also increase the backoff times in Bazel. They look pretty aggressive to me. |
@ulfjack I have a reproducible example, although I haven't been able to narrow it down to a small example repro I can share here. In order to hopefully collect more information I built bazel from source and replaced the
I've been able to confirm that |
@ulfjack -- I have created a small repo that reproduces the issue by building an open source project (protobufs): https://github.com/bjacklyn/bazel-remote-download-failure-repro I can file a separate github issue for this. It would be great if you can confirm that you are able to reproduce what I am seeing ^ though. There seem to be a lot of dials and knobs in order to hit the issue, but if my understanding is correct the root cause seems straight forward -- the channel is closed after it has been released which causes the next guy who acquires it to fail. |
Bazel should run the local action to rebuild the missing entry. I will work on this. |
Hit this on 3.7.2 with
We use greenhouse from kubernetes/test-infra as a cache. It also only evicts based on last-access time without anderstanding the action graph. It indeed looks like this only happens when a build is started and while it is running, the cache evicts entries. |
We are still hitting this with 4.0.0, with Google RBE gRPC remote cache:
|
@coeuvre is there a fix that we can cherrypick in 4.1? |
Sorry to be late. It looks like there are different issues posted on this thread.
@brentleyjones, can you please share more details about the error? (e.g. enable @keith Do you mind we close this issue and track it in #8250? |
That was with |
happy to track wherever you'd prefer! |
@brentleyjones Feel free to open a new issue with your problem. Let's track this issue in #8250. Closing. |
Description of the problem / feature request:
When using
--experimental_remote_download_outputs=toplevel
, and having a cache return a 500, bazel crashes. I would expect it to recover in this case and fall back to retrying the request, or worst case doing the build locally. Without this flag it does seem to be resilient to this.Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
bazel run TARGET
when pointing at an unhealthy http cache and passing--experimental_remote_download_outputs=toplevel
What operating system are you running Bazel on?
macOS
What's the output of
bazel info release
?release 0.26.0
cc @buchgr
The text was updated successfully, but these errors were encountered: