Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: potential issue/race in job shutdown #103687

Closed
miretskiy opened this issue May 19, 2023 · 6 comments · Fixed by #105480
Closed

jobs: potential issue/race in job shutdown #103687

miretskiy opened this issue May 19, 2023 · 6 comments · Fixed by #105480
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-kv KV Team

Comments

@miretskiy
Copy link
Contributor

miretskiy commented May 19, 2023

Not clear if this is jobs issue; might be some other issue, possibly with sql executor.
Pretty hard, but possible (>30 minutes) to reproduce on the master (using gce worker).

./dev test --stress --race --stress-args="-p 20" pkg/ccl/changefeedccl --filter TestAlterChangefeedSetDiffOption --show-logs 2>&1 | tee /tmp/stress

It's a changefeed test; but the failures happen always when the changefeed job finished; and we are clearing out job claim
as part of job cancellation.
stress_master.log

Stress run log attached.

Jira issue: CRDB-28136

@miretskiy miretskiy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 19, 2023
@tbg
Copy link
Member

tbg commented May 19, 2023

Relevant part:

==================
WARNING: DATA RACE
Write at 0x00c0079d2f88 by goroutine 5602:
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).splitEndTxnAndRetrySend()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:411 +0x764
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).maybeRefreshAndRetrySend()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:337 +0x5f5
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:275 +0x7bd
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:148 +0x1e6
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:190 +0x3e1
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:292 +0x245
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0xd4
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:246 +0x7cc
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send()
      github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:529 +0x9b2
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender()
      github.com/cockroachdb/cockroach/pkg/kv/db.go:1006 +0x190
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send()
      github.com/cockroachdb/cockroach/pkg/kv/txn.go:1091 +0x334
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).commit()
      github.com/cockroachdb/cockroach/pkg/kv/txn.go:707 +0x224
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Commit()
      github.com/cockroachdb/cockroach/pkg/kv/txn.go:722 +0xaf
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).commitSQLTransactionInternal()

racing with the race transport. @pavelkalinnikov could be something for KV Shared Secondary rotation.

@blathers-crl
Copy link

blathers-crl bot commented May 19, 2023

cc @cockroachdb/replication

@tbg tbg added T-kv KV Team and removed T-kv-replication labels May 19, 2023
@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label May 19, 2023
@nvanbenschoten
Copy link
Member

This appears to be fallout from #103241.

@nvanbenschoten nvanbenschoten self-assigned this May 22, 2023
@miretskiy
Copy link
Contributor Author

miretskiy commented May 22, 2023

It does appear that this race transport copies the slice, but retries on the client (which ought to be allowed) can mutate inner state.

on gce worker:
./dev test --stress --race --stress-args="-p 20" pkg/ccl/changefeedccl --filter TestTableNameCollision/kafka --show-logs 2>&1 | tee /tmp/stress
Reproduces in 10-20 mins.

@andrewbaptist
Copy link
Collaborator

I saw a similar issue here:

I tried running the test manually
./dev test --stress --race pkg/kv/kvserver/liveness -f TestGetActiveNodes
But was getting some issues running on my GCE worker. The error from that log is:

WARNING: DATA RACE
Write at 0x00c00383f958 by goroutine 115272:
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).splitEndTxnAndRetrySend()

craig bot pushed a commit that referenced this issue Jun 26, 2023
105391: colinfo: add version gate for storing pg_lsn types r=rafiss a=otan

We don't want mixed version clusters storing pg_lsn types, in case they need to rollback / older versions do not understand the type.

Informs #105130

Release note: None

105444: systemschema: stop running PostDeserializationChanges when building tables r=Xiang-Gu a=rafiss

### systemschema: add JobInfo to MakeSystemTables helper

This table was missing from the list, and the expected count of tables
was off since we weren't accounting for non-system tenant tables.

The function is only used for testing, so there was no impact.

---

### systemschema: stop running PostDeserializationChanges when building tables

We would like to remove old PostDeserializationChanges that are no
longer needed. In order to do so, we need to stop relying on them to
build system tables.

Instead, now we adjust the hard-coded system table descriptors and the
related helpers so that they create valid descriptors. This needed two
changes:
- Update the ConstraintID for check constraints.
- Update the primary index encoding so that it includes stored columns.

---

Epic: None
Release note: None

105476: kv: fix data race when updating pending txn in txnStatusCache r=arulajmani a=nvanbenschoten

Fixes #105244.

This commit avoids a data race by treating *roachpb.Transaction objects as immutable, and simply choosing the right transaction to keep in the cache when there is a choice to be made.

The behavior of this logic is tested by `TestTxnCacheUpdatesTxn`.

Release note: None

105480: kv: fix data race during retry of EndTxn after refresh r=arulajmani a=nvanbenschoten

Fixes #103687.
Fixes #103247.
Fixes #104791.

This commit avoids a data race between `splitEndTxnAndRetrySend` and `raceTransport` by avoiding a mutation of a shared `RequestUnion_EndTxn` object within an unshared `RequestUnion` object. The `raceTransport` makes an effort to copy the `BatchRequest`'s `RequestUnion` slice, but it does not copy the inner interface, so we can't play tricks to avoid a reallocation of the `RequestUnion_EndTxn`.

The commit also addresses a similar problem in
`retryTxnCommitAfterFailedParallelCommit`.

We may be able to fix this in the `raceTransport`, but doing so would require some reflection magic and this is currently failing CI, so we make the easier change.

Release note: None

105515: pgwire: fix race in TestConn r=knz a=rafiss

fixes #105410

A recent refactor introduced this race, since the context is used by two testing goroutines.

Release note: None

Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in e29a2bc Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants