Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarogarden: caretaker data race #296

Closed
jharveyb opened this issue Apr 25, 2023 · 6 comments · Fixed by #323
Closed

tarogarden: caretaker data race #296

jharveyb opened this issue Apr 25, 2023 · 6 comments · Fixed by #323
Assignees

Comments

@jharveyb
Copy link
Contributor

Found while trying to emulate CI tasks locally.

WARNING: DATA RACE
Read at 0x00c00043639c by goroutine 1067:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Cancel()
      /home/jhb/taro/taro/tarogarden/caretaker.go:147 +0x1d7
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).stateStep.func1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:733 +0x5b1

Previous write at 0x00c00043639c by goroutine 966:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).advanceStateUntil()
      /home/jhb/taro/taro/tarogarden/caretaker.go:236 +0x404
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).taroCultivator()
      /home/jhb/taro/taro/tarogarden/caretaker.go:271 +0x3a4
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0x39

Goroutine 1067 (running) created at:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).stateStep()
      /home/jhb/taro/taro/tarogarden/caretaker.go:713 +0x320d
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).advanceStateUntil()
      /home/jhb/taro/taro/tarogarden/caretaker.go:223 +0x3a7
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).taroCultivator()
      /home/jhb/taro/taro/tarogarden/caretaker.go:271 +0x3a4
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0x39

Goroutine 966 (running) created at:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0xc4
  sync.(*Once).doSlow()
      /usr/local/go/src/sync/once.go:74 +0x101
  sync.(*Once).Do()
      /usr/local/go/src/sync/once.go:65 +0x46
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start()
      /home/jhb/taro/taro/tarogarden/caretaker.go:122 +0x4c
  github.com/lightninglabs/taro/tarogarden.(*ChainPlanter).gardener()
      /home/jhb/taro/taro/tarogarden/planter.go:482 +0x87e
  github.com/lightninglabs/taro/tarogarden.(*ChainPlanter).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/planter.go:284 +0x39
==================
FAIL	github.com/lightninglabs/taro/tarogarden	837.179s

@jharveyb
Copy link
Contributor Author

Have reproduced in a separate run, curious as to why this doesn't appear in CI.

@jharveyb
Copy link
Contributor Author

Parsing the logs:

The 'main' goroutine for the caretaker, 966 here, is advancing the state machine via advanceStateUntil(), which writes to b.cfg.Batch.BatchState before returning.

A secondary goroutine, 1067 here, is started while in the BatchStateBroadcast state, to wait for TX confirmation and handle possible cancellation or error signals. stateStep() returns BatchStateBroadcast after starting that secondary goroutine (non-blocking), which should then be written to b.cfg.Batch.BatchState.

The secondary goroutine will call b.Cancel() if a cancellation signal is received, and b.Cancel() reads the current BatchState to use in a switch.

So if a cancellation signal is received before the main goroutine finishes writing BatchStateBroadcast, we would have a data race.

Not sure what the best way to mitigate this would be, but wrapping the state in a Mutex feels excessive. One option is making the logic that handles watching for a transcation confirmation blocking, which is already suggested via TODO by @Roasbeef .

I think that shouldn't change other planter & caretaker behavior much, since after we reach the terminal state BatchStateBroadcast we wait on the confirmation info or a quit / cancel signal. But it would change what batch state is observed from outside the caretaker, e.x. the planter.ListBatches call.

https://github.com/lightninglabs/taro-private/blob/681a44daa53aac996ef40c82c22f677f13847574/tarogarden/caretaker.go#L287

@jharveyb
Copy link
Contributor Author

What I was thinking of for mitigation:

-Add a sync.Mutex to the MintingBatch struct
-Add two methods for MintingBatch for reading and writing the state, so we only write the lock/unlock logic in one place; reading should return a copy and not a pointer
-Replace all direct accesses to batch state with those read & write helpers

@Roasbeef
Copy link
Member

Roasbeef commented May 11, 2023

Looking at the latest code, I see CancelReqChan is used to handle the incoming cancel. I think we can maybe just make BatchState an atomic var https://pkg.go.dev/sync/atomic#Int32?

@jharveyb
Copy link
Contributor Author

Nice find! That would work well, though added for golang 1.19 and we're still at 1.18 for the project (but we build with 1.20?).

@jharveyb
Copy link
Contributor Author

One drawback of using the atomic is that now the list of constants can't be the same type as the actual variable, but that seems acceptable.

@Roasbeef Roasbeef transferred this issue from another repository May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants