tarogarden: caretaker data race #296

jharveyb · 2023-04-25T20:31:05Z

Found while trying to emulate CI tasks locally.

WARNING: DATA RACE
Read at 0x00c00043639c by goroutine 1067:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Cancel()
      /home/jhb/taro/taro/tarogarden/caretaker.go:147 +0x1d7
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).stateStep.func1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:733 +0x5b1

Previous write at 0x00c00043639c by goroutine 966:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).advanceStateUntil()
      /home/jhb/taro/taro/tarogarden/caretaker.go:236 +0x404
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).taroCultivator()
      /home/jhb/taro/taro/tarogarden/caretaker.go:271 +0x3a4
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0x39

Goroutine 1067 (running) created at:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).stateStep()
      /home/jhb/taro/taro/tarogarden/caretaker.go:713 +0x320d
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).advanceStateUntil()
      /home/jhb/taro/taro/tarogarden/caretaker.go:223 +0x3a7
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).taroCultivator()
      /home/jhb/taro/taro/tarogarden/caretaker.go:271 +0x3a4
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0x39

Goroutine 966 (running) created at:
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start.func1()
      /home/jhb/taro/taro/tarogarden/caretaker.go:124 +0xc4
  sync.(*Once).doSlow()
      /usr/local/go/src/sync/once.go:74 +0x101
  sync.(*Once).Do()
      /usr/local/go/src/sync/once.go:65 +0x46
  github.com/lightninglabs/taro/tarogarden.(*BatchCaretaker).Start()
      /home/jhb/taro/taro/tarogarden/caretaker.go:122 +0x4c
  github.com/lightninglabs/taro/tarogarden.(*ChainPlanter).gardener()
      /home/jhb/taro/taro/tarogarden/planter.go:482 +0x87e
  github.com/lightninglabs/taro/tarogarden.(*ChainPlanter).Start.func1.1()
      /home/jhb/taro/taro/tarogarden/planter.go:284 +0x39
==================
FAIL	github.com/lightninglabs/taro/tarogarden	837.179s

The text was updated successfully, but these errors were encountered:

jharveyb · 2023-04-25T20:32:30Z

Have reproduced in a separate run, curious as to why this doesn't appear in CI.

jharveyb · 2023-04-26T19:07:51Z

Parsing the logs:

The 'main' goroutine for the caretaker, 966 here, is advancing the state machine via advanceStateUntil(), which writes to b.cfg.Batch.BatchState before returning.

A secondary goroutine, 1067 here, is started while in the BatchStateBroadcast state, to wait for TX confirmation and handle possible cancellation or error signals. stateStep() returns BatchStateBroadcast after starting that secondary goroutine (non-blocking), which should then be written to b.cfg.Batch.BatchState.

The secondary goroutine will call b.Cancel() if a cancellation signal is received, and b.Cancel() reads the current BatchState to use in a switch.

So if a cancellation signal is received before the main goroutine finishes writing BatchStateBroadcast, we would have a data race.

Not sure what the best way to mitigate this would be, but wrapping the state in a Mutex feels excessive. One option is making the logic that handles watching for a transcation confirmation blocking, which is already suggested via TODO by @Roasbeef .

I think that shouldn't change other planter & caretaker behavior much, since after we reach the terminal state BatchStateBroadcast we wait on the confirmation info or a quit / cancel signal. But it would change what batch state is observed from outside the caretaker, e.x. the planter.ListBatches call.

https://github.com/lightninglabs/taro-private/blob/681a44daa53aac996ef40c82c22f677f13847574/tarogarden/caretaker.go#L287

jharveyb · 2023-05-11T16:18:20Z

What I was thinking of for mitigation:

-Add a sync.Mutex to the MintingBatch struct
-Add two methods for MintingBatch for reading and writing the state, so we only write the lock/unlock logic in one place; reading should return a copy and not a pointer
-Replace all direct accesses to batch state with those read & write helpers

Roasbeef · 2023-05-11T20:21:27Z

Looking at the latest code, I see CancelReqChan is used to handle the incoming cancel. I think we can maybe just make BatchState an atomic var https://pkg.go.dev/sync/atomic#Int32?

jharveyb · 2023-05-11T20:32:54Z

Nice find! That would work well, though added for golang 1.19 and we're still at 1.18 for the project (but we build with 1.20?).

jharveyb · 2023-05-11T22:37:13Z

One drawback of using the atomic is that now the list of constants can't be the same type as the actual variable, but that seems acceptable.

dstadulis assigned jharveyb May 5, 2023

Roasbeef transferred this issue from another repository May 19, 2023

jharveyb mentioned this issue May 25, 2023

caretaker: make batch state atomic #323

Merged

guggero closed this as completed in #323 May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tarogarden: caretaker data race #296

tarogarden: caretaker data race #296

jharveyb commented Apr 25, 2023

jharveyb commented Apr 25, 2023

jharveyb commented Apr 26, 2023

jharveyb commented May 11, 2023

Roasbeef commented May 11, 2023 •

edited

Loading

jharveyb commented May 11, 2023

jharveyb commented May 11, 2023

tarogarden: caretaker data race #296

tarogarden: caretaker data race #296

Comments

jharveyb commented Apr 25, 2023

jharveyb commented Apr 25, 2023

jharveyb commented Apr 26, 2023

jharveyb commented May 11, 2023

Roasbeef commented May 11, 2023 • edited Loading

jharveyb commented May 11, 2023

jharveyb commented May 11, 2023

Roasbeef commented May 11, 2023 •

edited

Loading