Fixing WaitGroup related panics in the Go SDK #7661

t0yv0 · 2021-07-27T19:58:49Z

Description

Fixes #7275

Checklist

I have added tests that prove my fix is effective or that my feature works

I have updated the CHANGELOG-PENDING file with my change

Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

t0yv0 · 2021-07-27T20:07:00Z

Discussion on possible approaches to fix this:

The fact that String(x).Apply(foo) is unaware of pulumi.Context and does not participate in the join group is a considerable footgun but do we need breaking changes to fix it? Say allow only ctx.String() with ctx being pulumi.Context to construct these? Or somehow infer "current active pulumi.Context" automagically? https://github.com/jtolio/gls very controversial tool comes to mind again.
For the particular case of Docker resource, we had a chance in RegisterOutputs to make triple-sure the work of computing the resource outputs is registered with the context join group. Perhaps we could admit a fix along those lines, and that would make sure Wait() is waiting on ApplyT that calculates resource outputs in this case.
We could add some locking to the WaitGroup and try to handle the situation of Wait() and Add(1) racing in WaitGroup state 0 better handled specifically for the case of logging. That is, we'd accept that there is a race, by design, and it's indeterminate if the ctx.Log work should be in or out of the set of "all work" that has too be awaited, but we could resolve the situation gracefully by for example ignoring the race, and ignoring whether logs propagated or not.

t0yv0 · 2021-07-27T20:10:25Z

Copying from the ticket the minimal Go program that helped me understand when/why Go runtime throws a panic:

package main

import (
	"fmt"
	"sync"
	"time"
)

// Note that calls with a positive delta that occur when the counter is zero must happen before a Wait.

func main() {
	wg := &sync.WaitGroup{}
	wg.Add(1)

	go func() {
		fmt.Printf("Adding\n")
		wg.Add(-1)
		wg.Add(1)
		fmt.Printf("Added\n")
	}()

	wg.Wait()
	time.Sleep(10)
}

Note that unfortunately the panic is attributed to Wait() call not Add call.

pgavlin · 2021-07-27T20:37:47Z

Oof. This is pretty bad. Aside from the panic, this provides an easy route for asynchronous work to go untracked.

One thing we could consider to ameliorate the problem of untracked asynchronous work is the addition of a fast path in ApplyT for the case that the receiver is already resolved. This fast path would promptly run the callback and create and return an already-resolved output if the result of the callback does not contain any Outputs.

We could add some locking to the WaitGroup and try to handle the situation of Wait() and Add(1) racing in WaitGroup state 0 better handled specifically for the case of logging. That is, we'd accept that there is a race, by design, and it's indeterminate if the ctx.Log work should be in or out of the set of "all work" that has too be awaited, but we could resolve the situation gracefully by for example ignoring the race, and ignoring whether logs propagated or not.

This seems like the best way to resolve the panic. I think that if we do not want to refactor, we will need to atomically swap in a new WaitGroup that is never Waited immediately prior to calling Wait on the original WaitGroup.

t0yv0 · 2021-07-27T20:46:56Z

Ok let me try something along these lines.

t0yv0 · 2021-07-28T21:21:49Z

@pgavlin PTAL..

Catching/recovering from panics on sync.WaitGroup proved to be not workable. It throws panics in Add in addition to Wait. I was also not clear if it leaves the WaitGroup in a good state after a panic. My attempts to work that out ended in live-lock or deadlock on the tests.

Because of this, I rolled our own version of sync.WaitGroup that does not have the race detection enabled. In the situation where previously we panicked, the thread doing Add will wait on the lock held by the thread doing Wait. Then the thread doing Wait will complete successfully. Basically Pulumi program may finish with the extra racing work being ignored.

Unfortunately this does not discriminate between the use for ctx.Log and use for Output - it seems tricky to do so, perhaps involving rebuilding the race detection in our clone of sync.WaitGroup.

I'd tackle fast-path for ApplyT in a separate PR.

pgavlin

Okay, this seems reasonable to me. Thanks for the tests here!

`time.Sleep(10)` is perceived as a bug because there's no unit. If the author was coming from another programming language they could have meant milliseconds or seconds (unlikely). Make the nanosecond unit explicit. Issue caught by staticcheck: ``` go/pulumi/context_test.go:93:14: SA1004: sleeping for 10 nanoseconds is probably a bug; be explicit if it isn't (staticcheck) ``` Refs #7661 Refs #11808

11868: sdk/context/test: Explicitly sleep for 10 nanoseconds r=abhinav a=abhinav `time.Sleep(10)` is perceived as a bug because there's no unit. If the author was coming from another programming language they could have meant milliseconds or seconds (unlikely). Make the nanosecond unit explicit. Issue caught by staticcheck: ``` go/pulumi/context_test.go:93:14: SA1004: sleeping for 10 nanoseconds is probably a bug; be explicit if it isn't (staticcheck) ``` Refs #7661 Refs #11808 --- I'm not sure if this was actually intended to be nanoseconds. The test still passes with milliseconds and microseconds, although with milliseconds it takes a lot longer because of the many iterations. CC `@t0yv0,` in case you remember the unit you wanted here. Co-authored-by: Abhinav Gupta <[email protected]>

t0yv0 requested review from pgavlin, justinvp and EvanBoyle July 27, 2021 20:07

t0yv0 changed the title ~~WIP on fixing WaitGroup related panics in the Go SDK~~ Fixing WaitGroup related panics in the Go SDK Jul 28, 2021

t0yv0 added 4 commits July 28, 2021 17:12

Add tests reproducing the panics

57f0cc0

Add v2 of the test

8433091

Use tailor-made workGroup to mimic sync.WaitGroup

e59c79f

Lint

0aaa1d4

t0yv0 force-pushed the t0yv0/7275 branch from 1ba8525 to 0aaa1d4 Compare July 28, 2021 21:12

CHANGELOG

a346a80

t0yv0 marked this pull request as ready for review July 28, 2021 21:15

pgavlin approved these changes Jul 29, 2021

View reviewed changes

t0yv0 merged commit 5788bef into master Jul 29, 2021

pulumi-bot deleted the t0yv0/7275 branch July 29, 2021 16:39

abhinav mentioned this pull request Jan 13, 2023

sdk/context/test: Explicitly sleep for 10 nanoseconds #11868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing WaitGroup related panics in the Go SDK #7661

Fixing WaitGroup related panics in the Go SDK #7661

t0yv0 commented Jul 27, 2021 •

edited

Loading

t0yv0 commented Jul 27, 2021 •

edited

Loading

t0yv0 commented Jul 27, 2021

pgavlin commented Jul 27, 2021

t0yv0 commented Jul 27, 2021

t0yv0 commented Jul 28, 2021

pgavlin left a comment

Fixing WaitGroup related panics in the Go SDK #7661

Fixing WaitGroup related panics in the Go SDK #7661

Conversation

t0yv0 commented Jul 27, 2021 • edited Loading

Description

Checklist

t0yv0 commented Jul 27, 2021 • edited Loading

t0yv0 commented Jul 27, 2021

pgavlin commented Jul 27, 2021

t0yv0 commented Jul 27, 2021

t0yv0 commented Jul 28, 2021

pgavlin left a comment

Choose a reason for hiding this comment

t0yv0 commented Jul 27, 2021 •

edited

Loading

t0yv0 commented Jul 27, 2021 •

edited

Loading