Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing WaitGroup related panics in the Go SDK #7661

Merged
merged 5 commits into from
Jul 29, 2021
Merged

Fixing WaitGroup related panics in the Go SDK #7661

merged 5 commits into from
Jul 29, 2021

Conversation

t0yv0
Copy link
Member

@t0yv0 t0yv0 commented Jul 27, 2021

Description

Fixes #7275

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 27, 2021

Discussion on possible approaches to fix this:

  • The fact that String(x).Apply(foo) is unaware of pulumi.Context and does not participate in the join group is a considerable footgun but do we need breaking changes to fix it? Say allow only ctx.String() with ctx being pulumi.Context to construct these? Or somehow infer "current active pulumi.Context" automagically? https://github.com/jtolio/gls very controversial tool comes to mind again.

  • For the particular case of Docker resource, we had a chance in RegisterOutputs to make triple-sure the work of computing the resource outputs is registered with the context join group. Perhaps we could admit a fix along those lines, and that would make sure Wait() is waiting on ApplyT that calculates resource outputs in this case.

  • We could add some locking to the WaitGroup and try to handle the situation of Wait() and Add(1) racing in WaitGroup state 0 better handled specifically for the case of logging. That is, we'd accept that there is a race, by design, and it's indeterminate if the ctx.Log work should be in or out of the set of "all work" that has too be awaited, but we could resolve the situation gracefully by for example ignoring the race, and ignoring whether logs propagated or not.

@t0yv0 t0yv0 requested review from pgavlin, justinvp and EvanBoyle July 27, 2021 20:07
@t0yv0
Copy link
Member Author

t0yv0 commented Jul 27, 2021

Copying from the ticket the minimal Go program that helped me understand when/why Go runtime throws a panic:

package main

import (
	"fmt"
	"sync"
	"time"
)

// Note that calls with a positive delta that occur when the counter is zero must happen before a Wait.

func main() {
	wg := &sync.WaitGroup{}
	wg.Add(1)

	go func() {
		fmt.Printf("Adding\n")
		wg.Add(-1)
		wg.Add(1)
		fmt.Printf("Added\n")
	}()

	wg.Wait()
	time.Sleep(10)
}

Note that unfortunately the panic is attributed to Wait() call not Add call.

@pgavlin
Copy link
Member

pgavlin commented Jul 27, 2021

Oof. This is pretty bad. Aside from the panic, this provides an easy route for asynchronous work to go untracked.

One thing we could consider to ameliorate the problem of untracked asynchronous work is the addition of a fast path in ApplyT for the case that the receiver is already resolved. This fast path would promptly run the callback and create and return an already-resolved output if the result of the callback does not contain any Outputs.

We could add some locking to the WaitGroup and try to handle the situation of Wait() and Add(1) racing in WaitGroup state 0 better handled specifically for the case of logging. That is, we'd accept that there is a race, by design, and it's indeterminate if the ctx.Log work should be in or out of the set of "all work" that has too be awaited, but we could resolve the situation gracefully by for example ignoring the race, and ignoring whether logs propagated or not.

This seems like the best way to resolve the panic. I think that if we do not want to refactor, we will need to atomically swap in a new WaitGroup that is never Waited immediately prior to calling Wait on the original WaitGroup.

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 27, 2021

Ok let me try something along these lines.

@t0yv0 t0yv0 changed the title WIP on fixing WaitGroup related panics in the Go SDK Fixing WaitGroup related panics in the Go SDK Jul 28, 2021
@t0yv0 t0yv0 marked this pull request as ready for review July 28, 2021 21:15
@t0yv0
Copy link
Member Author

t0yv0 commented Jul 28, 2021

@pgavlin PTAL..

Catching/recovering from panics on sync.WaitGroup proved to be not workable. It throws panics in Add in addition to Wait. I was also not clear if it leaves the WaitGroup in a good state after a panic. My attempts to work that out ended in live-lock or deadlock on the tests.

Because of this, I rolled our own version of sync.WaitGroup that does not have the race detection enabled. In the situation where previously we panicked, the thread doing Add will wait on the lock held by the thread doing Wait. Then the thread doing Wait will complete successfully. Basically Pulumi program may finish with the extra racing work being ignored.

Unfortunately this does not discriminate between the use for ctx.Log and use for Output - it seems tricky to do so, perhaps involving rebuilding the race detection in our clone of sync.WaitGroup.

I'd tackle fast-path for ApplyT in a separate PR.

Copy link
Member

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this seems reasonable to me. Thanks for the tests here!

@t0yv0 t0yv0 merged commit 5788bef into master Jul 29, 2021
@pulumi-bot pulumi-bot deleted the t0yv0/7275 branch July 29, 2021 16:39
abhinav added a commit that referenced this pull request Jan 13, 2023
`time.Sleep(10)` is perceived as a bug because there's no unit.
If the author was coming from another programming language
they could have meant milliseconds or seconds (unlikely).

Make the nanosecond unit explicit.

Issue caught by staticcheck:

```
go/pulumi/context_test.go:93:14: SA1004: sleeping for 10 nanoseconds is probably a bug; be explicit if it isn't (staticcheck)
```

Refs #7661
Refs #11808
abhinav added a commit that referenced this pull request Jan 14, 2023
`time.Sleep(10)` is perceived as a bug because there's no unit.
If the author was coming from another programming language
they could have meant milliseconds or seconds (unlikely).

Make the nanosecond unit explicit.

Issue caught by staticcheck:

```
go/pulumi/context_test.go:93:14: SA1004: sleeping for 10 nanoseconds is probably a bug; be explicit if it isn't (staticcheck)
```

Refs #7661
Refs #11808
bors bot added a commit that referenced this pull request Jan 14, 2023
11868: sdk/context/test: Explicitly sleep for 10 nanoseconds r=abhinav a=abhinav

`time.Sleep(10)` is perceived as a bug because there's no unit.
If the author was coming from another programming language
they could have meant milliseconds or seconds (unlikely).

Make the nanosecond unit explicit.

Issue caught by staticcheck:

```
go/pulumi/context_test.go:93:14: SA1004: sleeping for 10 nanoseconds is probably a bug; be explicit if it isn't (staticcheck)
```

Refs #7661
Refs #11808

---

I'm not sure if this was actually intended to be nanoseconds.
The test still passes with milliseconds and microseconds,
although with milliseconds
it takes a lot longer because of the many iterations.

CC `@t0yv0,` in case you remember the unit you wanted here.


Co-authored-by: Abhinav Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with the work for waiting async tasks in Go SDK
2 participants