-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent failed to get state for index
errors
#6111
Comments
Have been observing it in the past days too: https://github.com/dagger/dagger/actions/runs/6937495840/job/18871658791?pr=6136 |
Just hit this again while releasing |
I also get "failed to compute cache key" errors in CI on K8 with the runner, but I get a different "source":
|
Whelp! This error is happening a lot in Python tests:
That copy comes from here: dagger/sdk/python/runtime/main.go Lines 202 to 206 in d3a11cf
Seems to be more frequent with the tests around the lock file. See anything wrong here? dagger/core/integration/module_python_test.go Lines 638 to 711 in d3a11cf
Maybe this is a simpler case to debug and figure this out. |
Just FYI I'm attempting to debug this and the other (possibly related) |
More details in #7128, but ended up being able to very consistently repro this locally and figured out the bug upstream, fix there moby/buildkit#4887 Will leave this open until we've picked up the upstream fix, done a release with it, upgraded CI runners and confirmed that the errors are gone now |
Yes! FTR, the new CI runners for Engine & CLI are All other CI runners are still bunching around a single Engine, so we can expect this to continue being an issue. 💪 |
The worst of this seems to be resolved by #7295, so going to take this out of the milestone, but worth leaving open until the upstream fix is resolved. |
We've been running with this fix in place for a while and haven't seen it since (used to happen 1 or a few times a day). So I'll close this out until proven otherwise. |
These sorts of errors have been happening occasionally (~1 / day) in our CI approximately since we switched over everything to our shared runners:
(run)
The exact message varies but seems to always happen on FileOps and include
failed to compute cache key: failed to get state for index 0
.There is an upstream issue for this here: moby/buildkit#3635
If we can repro the error and get useful data will move to the upstream issue (or just send PR w/ fix). Creating this to track in case anyone hits it and searches for the error message.
The fact that it seems to have only started once we switched to the shared runner is notable. My immediate gut reaction is that it probably has something to do with concurrent solves that overlap in vertexes since that's proven to be a fairly tricky + error prone codepath in buildkit and is something that would have not happened previously before switching to the shared runners.
The text was updated successfully, but these errors were encountered: