-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: failed to get edge #2303
Comments
Hello, I am experiencing this same panic when using In order to try and identify what the issue is I have built
In the error scenario it is failing to find an entry in I have removed the panic and whilst the jobs fail when this (race condition?) happens, the Therefore I am currently looking to see if entries in If anyone has any suggestions of things to try then I'd love to hear them! Thanks, James |
I think I have managed to confirm that the entries in
So the digest that eventually fails is deleted just before it is used. Can anyone suggest the best approach to fixing this? If I can, I'd like to try. Thanks! |
@jamesalucas We would need to understand what triggers the |
Is any of the following likely in your case:
|
We have a lot of concurrent builds which:
I'll check if this line you mentioned appears in the logs. |
@maxlaverse but no |
We run a few different sort of build, but the most common and for which the concurrency is the highest, the invocation looks like this:
|
We are using buildkit via Earthly but I assume Earthly's However, I have changed the
|
I've added a bit more logging and I can see that there are 2 scenarios in which I am seeing errors:
Example output for case 1:
Example output for case 2:
@tonistiigi are you able to suggest what area might need to change to fix this? I have had a couple of attempts but not certain how to go about addressing this so any hints would be appreciated! |
First can you confirm where Line 292 in b2ff444
Then Line 402 in b2ff444
For the first case it looks weird as even when the edge is replaced it should not invalidate the digest in actives array. It should still be able to make queries through the old digest. So I wonder if the discarding in here is correct but there is something else that wrongly requests it later. Maybe because it is an input to another merged egde? The edge can only be accessed through its vertex digest. Vertex is released by reference counting jobs. When last job goes away then it is released. That code looks quite trivial (except for the One case that comes to mind is what would happen if first job(J1) goes away, E1 is referenced to (V2) but E1 itself is not fully processed yet. Now it could ask for its inputs but these may be already released because they were loaded with the original job? But none of that is confirmed by your traces as you seem to be getting the error from the same edge that was part of the merge. Another thought is that are these merges really valid at all and maybe there is some loop in here that confuses things. There is a loop detection in Lines 173 to 174 in b2ff444
I know this is all very confusing. We have a debug mode in scheduler Line 17 in b2ff444
For other extra logging what might help if you have logs for every vertex load and deletion, every job start/delete, and what are the inputs of each vertex. That should give an understanding when the merge happens what object is from what job and if merge should not have happened or if deletion should not have happened. Also while logging print the current reference count for vertex. On merge print the reference count of edges and on |
If you are wondering what this "merge" logic is at all there is some description in https://github.com/moby/buildkit/blob/master/docs/solver.md#merging-edges |
@tonistiigi thanks very much for the assistance. Here's what I have found so far...
Yes, that's correct, I have a 'No active vertex' log on that line which results in the following log:
I have added a log statement there and it doesn't appear in my logs, so I can confirm it's not going down this path.
I've added as much of the suggested logging as possible and have had error case 2 happen and what the logs appear to suggest is that:
I don't fully understand what should happen but is it possible that when the mergeTo happens, the job that contained the source digest needs to be added to the states job map so that it doesn't get removed when the original job finishes, and is kept until subsequent jobs finish? I did see if I could do that but I can't see how you would get the job ID that created the source merge from within mergeTo. Here is what I think is the relevant output from the log:
Perhaps I need to log all jobs in the state next. |
Which job is |
Where are these being logged? |
@tonistiigi a4ace11a6c482e05b55fee0e1f5720b42e7f2e2271a0956c039bdf2e17ee910e is from the first job, apologies, I forgot to attach the whole log, I've attached it here but it's quite large so I tried to include only bits I thought were relevant. The
In our CI environment we have 10 runners on the same machine, all using buildkit running in Docker via Earthly. It appears that one call to Earthly will result in a job being created in buildkit and subsequent calls from other runners will create subsequent jobs in buildkit, I guess that is all to be expected; but edges are merged between earlier and later jobs. If the earlier job finishes before the later one its edges are removed then we get this error as the later job is now referencing an edge that was merged to by the first (and subsequent) jobs? I have made a change which appears to fix it, but I don't doubt there is a better way... I have added some additional lines to setEdge so that when two edges are merged the jobs from one are copied onto the other. This then prevents them from being removed until both jobs have finished. I have doubts that jobs from the target need to be copied onto the source, but given error scenario 1 above seems to suggest it is necessary, I thought I would start out conservative and see if it fixes it and if so, work back to a better solution.
This results in the edges being kept until any job using them is removed, but I suspect there is a better fix?! |
I've been running this since yesterday and although the number has reduced, I'm still seeing |
I have had numerous attempts at fixing this but so far every one has failed. I have checked the reference counts and they appear to be maintained correctly but it still appears that some edges are removed whilst other jobs remain dependent upon them. |
@jamesalucas Maybe you now have some insights how to write a reproducer for this. it is fine also if a code change needs to be made to make it reproduce more often. |
Ran into this today on v0.9.2
it doesn't usually happen but today happened on two separate occasions, for about 5 of the images in each occasion
We have 3 retries for each image and about 4 of the images succeeded on one of the retries while one image failed all 3 retries |
I can't help with a reproducer unfortunately, as our setup isn't a simple use-case, but I hope I can by adding some further examples which are triggering the condition, and how we get to here. Similar to other reporters, we:
Here is an example stack trace, which was building 5 parallel targets at the time: A bit more detail on the caching strategy that we use, since this bug seems to revolve around caching usage...and perhaps we're doing registry caching wrong? Our image library is around 8 stages deep, and 15-20 wide, culminating in around 60 final run stage/images. Buildkit always crashes if we simply ask it to build "all final stages" (1300+ steps), so we instead use multi-stage build pipelines to carefully and progressively iterate down through the entire "famliy tree" of stages, one "stage-layer" levels at time, so as not to overwhelm and crash buildkit.
I'm not sure if this is how large-scale registry caching is intended to be used, but |
I can reproduce this failure within a few seconds now. I think it is also related to #3714. I was running into this issue when starting to load-test various scenarios. The issue seems to be related to a race when solving dockerfiles in parallel where requests share common vertices but differ via the I am reproducing this via buildkit in a docker container:
Then I run the repro case with:
With this I am randomly seeing one of these errors from buildkitd:
or
Also probably related, if we remove the
I have no idea where this issue is specifically, and I am not familiar with the solver logic, so hopefully someone can use this to narrow down what is really going on. |
Created a slightly simpler reproduction using client.Solve and llb.Def directly: The issue seems to be related to |
I'm hitting this when using bake in GHA and building containers that share lots of stages but differ in UID/GID and seems to fail around My workaround has been to split up the build into different steps which is not ideal because it sort of goes against the whole point of using bake.. but a working build beats a broken one. Also, I have not yet had this happen when running bake locally (without splitting anything up). |
Still hitting this error, though I've got a few more details (still no repro :/). I've reduced the number of stages/images that I build and the average reliability of I'd like to add that the targets I'm building are all using registry cache with
|
I'm not sure which
buildctl
command triggered the panic. We are building a few different images in parallel. The general command format is:Version:
Config:
Error:
The text was updated successfully, but these errors were encountered: