Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: TestHalfOnlineLossOfQuorumRecovery failed [data race in asyncWriteToOtelAndSystemEventsTable] #103698

Closed
cockroach-teamcity opened this issue May 21, 2023 · 5 comments
Assignees
Labels
A-observability-inf branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 21, 2023

Epic: CRDB-28893

cli.TestHalfOnlineLossOfQuorumRecovery failed with artifacts on release-23.1 @ 1443aa2b0a4b906ba2b252da53e04097ed75051a:

      github.com/cockroachdb/cockroach/pkg/server/decommission.go:393 +0x63b
  github.com/cockroachdb/cockroach/pkg/server.maybeRunLossOfQuorumRecoveryCleanup.func3()
      github.com/cockroachdb/cockroach/pkg/server/loss_of_quorum.go:175 +0x2d0
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x1f6

Goroutine 775001 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x619
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:332 +0x67e
  github.com/cockroachdb/cockroach/pkg/server.maybeRunLossOfQuorumRecoveryCleanup()
      github.com/cockroachdb/cockroach/pkg/server/loss_of_quorum.go:155 +0x41
  github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()
      github.com/cockroachdb/cockroach/pkg/server/server.go:2013 +0x66ec
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:617 +0x8f
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).RestartServerWithInspect.func1()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1654 +0xcd
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).RestartServerWithInspect()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1665 +0x3e4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).RestartServerWithInspect()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1633 +0x359
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).RestartServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1593 +0x1bc4
  github.com/cockroachdb/cockroach/pkg/cli.TestHalfOnlineLossOfQuorumRecovery()
      github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:555 +0x1bc5
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:444 +0x1144
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:584 +0xd4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:390 +0x850
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:584 +0xd4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:390 +0x850
  github.com/cockroachdb/cockroach/pkg/cli.TestHalfOnlineLossOfQuorumRecovery()
      github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:473 +0x853
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:330 +0xe10
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:330 +0xe10
  github.com/cockroachdb/cockroach/pkg/cli.TestHalfOnlineLossOfQuorumRecovery()
      github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:469 +0x837
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      GOROOT/src/testing/testing.go:1493 +0x47
==================

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-28142

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team labels May 21, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone May 21, 2023
@erikgrinaker
Copy link
Contributor

Data race in the event log:

WARNING: DATA RACE
Read at 0x00c00f219de8 by goroutine 775017:
  github.com/cockroachdb/cockroach/pkg/util/log/logpb.(*CommonEventDetails).AppendJSONFields()
      github.com/cockroachdb/cockroach/pkg/util/log/logpb/bazel-out/k8-fastbuild/bin/pkg/util/log/logpb/json_encode_generated.go:27 +0x1db
  github.com/cockroachdb/cockroach/pkg/util/log/eventpb.(*NodeDecommissioning).AppendJSONFields()
      github.com/cockroachdb/cockroach/pkg/util/log/eventpb/bazel-out/k8-fastbuild/bin/pkg/util/log/eventpb/json_encode_generated.go:3230 +0x59
  github.com/cockroachdb/cockroach/pkg/sql.prepareEventWrite()
      github.com/cockroachdb/cockroach/pkg/sql/event_log.go:711 +0x67d
  github.com/cockroachdb/cockroach/pkg/sql.asyncWriteToOtelAndSystemEventsTable.func1()
      github.com/cockroachdb/cockroach/pkg/sql/event_log.go:634 +0x1ef
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x1f6

Previous write at 0x00c00f219de8 by goroutine 775001:
  github.com/cockroachdb/cockroach/pkg/sql.insertEventRecords()
      github.com/cockroachdb/cockroach/pkg/sql/event_log.go:533 +0x149
  github.com/cockroachdb/cockroach/pkg/sql.InsertEventRecords()
      github.com/cockroachdb/cockroach/pkg/sql/event_log.go:494 +0xc4
  github.com/cockroachdb/cockroach/pkg/server.(*Server).Decommission()
      github.com/cockroachdb/cockroach/pkg/server/decommission.go:393 +0x63b
  github.com/cockroachdb/cockroach/pkg/server.maybeRunLossOfQuorumRecoveryCleanup.func3()
      github.com/cockroachdb/cockroach/pkg/server/loss_of_quorum.go:175 +0x2d0
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x1f6

@erikgrinaker
Copy link
Contributor

@knz This seems to be a legit race in asyncWriteToOtelAndSystemEventsTable which was introduced in #86174. insertEventRercords mutates the given events here:

event.CommonDetails().EventType = eventType

Concurrently with asyncWriteToOtelAndSystemEventsTable reading the event here:

_, infoBytes = event.AppendJSONFields(false /* printComma */, infoBytes)

We'll probably have to pass owned struct copies rather than pointers to avoid races here.

@erikgrinaker erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 22, 2023
@erikgrinaker erikgrinaker changed the title cli: TestHalfOnlineLossOfQuorumRecovery failed sql: data race in asyncWriteToOtelAndSystemEventsTable May 22, 2023
@erikgrinaker erikgrinaker changed the title sql: data race in asyncWriteToOtelAndSystemEventsTable cli: TestHalfOnlineLossOfQuorumRecovery failed [data race in asyncWriteToOtelAndSystemEventsTable] May 22, 2023
@knz
Copy link
Contributor

knz commented Jul 7, 2023

I have investigated this and sadly it is surprising hard to troubleshoot.
At face value, the race condition does not make sense: asyncWriteToOtelAndSystemEventsTable is only run after the write is guaranteed to have completed. It's the very definition of properly sequenced accesses.

But maybe there's something else. I'm running a stress to see.

@knz
Copy link
Contributor

knz commented Jul 7, 2023

Ok I got it - the same event object is used across multiple calls from server/decommission.go.

I'll send a PR.

craig bot pushed a commit that referenced this issue Jul 7, 2023
106396: server: de-flake a decommission test race condition r=abarganier a=knz

Informs #103698 (will fix when backported to 23.1).
Epic: CRDB-28893

When a Decommission request is sent that addresses multiple nodes simultaneously, a race condition existed in the code that logs the decommission event to the event log.

This is because the `sql.InsertEventRecords` API expects to take ownership over the events. The `Decommission` handler was violating the expectation by passing the same event references to multiple subsequent calls.

This was not visible in practice however, because the racy writes were always overwriting the same value to the same field.

This patch fixes it by allocating a new event for each subsequent node decommission.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@knz
Copy link
Contributor

knz commented Jul 7, 2023

fixed in #106401

@knz knz closed this as completed Jul 7, 2023
@exalate-issue-sync exalate-issue-sync bot reopened this Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants