Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: avoid "context canceled" error on cleanup #1746

Merged
merged 1 commit into from
May 10, 2023

Conversation

ktock
Copy link
Collaborator

@ktock ktock commented Apr 18, 2023

Fixes #1640 (comment)

The second .Build request we create always seems to exit with context cancelled - this means the buildkit logs are full of time="2023-04-18T12:02:37Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled". This now happens on every build, since we always spin up a second Build right away in getResultAt, instead of waiting until invoke is called.

Reproducer: buildx build . --builder=dev --detach=false

This is because buildx exits before Build() (invoked in a goroutine) completes and returns.
This commit fixes the code to ensure completion of Build() on cleanup.

cc @jedevc

@ktock ktock marked this pull request as draft April 18, 2023 13:20
@ktock ktock marked this pull request as ready for review April 18, 2023 13:31
@jedevc
Copy link
Collaborator

jedevc commented Apr 18, 2023

I'm not sure why we get different results between the local and remote controller? I'm guessing one gets cancelled, and the other doesn't?

Did you manage to work this out?

// wait for Build() completion(or timeout) to ensure the Build's finalizing and avoiding an error "context canceled"
select {
case <-buildDoneCh:
case <-time.After(5 * time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a delay here, I think this should block?

Actually, instead of introducing another channel, could we keep the custom context we had before but use context.WithCancelCause and then use a custom error type to cancel it in the implementation of gwDone, then we can just detect that error when loading from ctx.Err() and return a nil error in that case? I think that might be a neater implementation that doesn't require we go and add another channel.

With the current refactor gwCtx isn't cancelled after gwDone is called, so we need to make sure it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we get different results between the local and remote controller? I'm guessing one gets cancelled, and the other doesn't?

Did you manage to work this out?

With this patch, the behaviour of both of the controllers seem to be the same. On --sbom=false, they successfully cleanup without "context canceled" error. On --sbom=true, they cause panic (#1640 (comment)) on buildkitd.

I don't think we should have a delay here, I think this should block?

BuildKit's client.Build() seems to block indefinitely when buildkitd panics. So timeout is needed if we want to avoid buildx blocking indefinitely when buildkitd panics.

context.WithCancelCause

Fixed to use context.WithCancelCause. We still need a channel to ensure gwDone returns after Build() returns. Otherwise buildx process (running local controller) can exit after completion of gwDone but before completion of Build(), which results in BuildKit reporting cancellation error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this seems better to me, thanks 🎉

@jedevc
Copy link
Collaborator

jedevc commented Apr 19, 2023

That seems a lot better from the buildkit side thanks!

I do get some weird errors on the buildx side when using the local controller though - I switch away from the created process from on-error, and then ctrl-d to exit the monitor:

------
 > [shell 10/10] RUN sleep 1 && date +%s && exit 1:
#0 1.042 1681899483
------
Launching interactive container. Press Ctrl-a-c to switch to monitor console
Interactive container was restarted with process "4i4lpkrn9rigkdqcfjg7yd4g4". Press Ctrl-a-c to switch to the new container
/ # ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr    work
/ # Switched IO
(buildx) ERROR: failed to exec process: context canceled
WARNING: failed to kill process: process lsfdpidkv1d00hiwmcerycf8e:wvuwhyxj39foeiu715889j6j7 has ended, not sending message &moby_buildkit_v1_frontend.ExecMessage_Signal{Signal:(*moby_buildkit_v1_frontend.SignalMessage)(0xc0009f4bd0)}

or, sometimes:

------
 > [shell 10/10] RUN sleep 1 && date +%s && exit 1:
#0 1.038 1681899507
------
WARNING: No output specified with remote driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Launching interactive container. Press Ctrl-a-c to switch to monitor console
Interactive container was restarted with process "pr38ukrbgbcwamf9wy3ln3af9". Press Ctrl-a-c to switch to the new container
/ # ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr    work
/ # Switched IO
(buildx) ERROR: failed to exec process: context canceled

@jedevc
Copy link
Collaborator

jedevc commented Apr 19, 2023

Heads-up, it looks like these issues are resolved in #1750, by simplifying to only use a single client.Build call, so we only ever have one gateway per build.

@jedevc
Copy link
Collaborator

jedevc commented May 10, 2023

Rebased onto master to resolve conflicts.

@jedevc jedevc added this to the v0.11.0 milestone May 10, 2023
@jedevc jedevc merged commit 2eeef18 into docker:master May 10, 2023
@ktock ktock deleted the resultcleanup branch May 15, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants