-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: potential issue/race in job shutdown #103687
Comments
Relevant part:
racing with the race transport. @pavelkalinnikov could be something for KV Shared Secondary rotation. |
cc @cockroachdb/replication |
This appears to be fallout from #103241. |
It does appear that this race transport copies the slice, but retries on the client (which ought to be allowed) can mutate inner state. on gce worker: |
I saw a similar issue here: I tried running the test manually
|
105391: colinfo: add version gate for storing pg_lsn types r=rafiss a=otan We don't want mixed version clusters storing pg_lsn types, in case they need to rollback / older versions do not understand the type. Informs #105130 Release note: None 105444: systemschema: stop running PostDeserializationChanges when building tables r=Xiang-Gu a=rafiss ### systemschema: add JobInfo to MakeSystemTables helper This table was missing from the list, and the expected count of tables was off since we weren't accounting for non-system tenant tables. The function is only used for testing, so there was no impact. --- ### systemschema: stop running PostDeserializationChanges when building tables We would like to remove old PostDeserializationChanges that are no longer needed. In order to do so, we need to stop relying on them to build system tables. Instead, now we adjust the hard-coded system table descriptors and the related helpers so that they create valid descriptors. This needed two changes: - Update the ConstraintID for check constraints. - Update the primary index encoding so that it includes stored columns. --- Epic: None Release note: None 105476: kv: fix data race when updating pending txn in txnStatusCache r=arulajmani a=nvanbenschoten Fixes #105244. This commit avoids a data race by treating *roachpb.Transaction objects as immutable, and simply choosing the right transaction to keep in the cache when there is a choice to be made. The behavior of this logic is tested by `TestTxnCacheUpdatesTxn`. Release note: None 105480: kv: fix data race during retry of EndTxn after refresh r=arulajmani a=nvanbenschoten Fixes #103687. Fixes #103247. Fixes #104791. This commit avoids a data race between `splitEndTxnAndRetrySend` and `raceTransport` by avoiding a mutation of a shared `RequestUnion_EndTxn` object within an unshared `RequestUnion` object. The `raceTransport` makes an effort to copy the `BatchRequest`'s `RequestUnion` slice, but it does not copy the inner interface, so we can't play tricks to avoid a reallocation of the `RequestUnion_EndTxn`. The commit also addresses a similar problem in `retryTxnCommitAfterFailedParallelCommit`. We may be able to fix this in the `raceTransport`, but doing so would require some reflection magic and this is currently failing CI, so we make the easier change. Release note: None 105515: pgwire: fix race in TestConn r=knz a=rafiss fixes #105410 A recent refactor introduced this race, since the context is used by two testing goroutines. Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
Not clear if this is jobs issue; might be some other issue, possibly with sql executor.
Pretty hard, but possible (>30 minutes) to reproduce on the master (using gce worker).
./dev test --stress --race --stress-args="-p 20" pkg/ccl/changefeedccl --filter TestAlterChangefeedSetDiffOption --show-logs 2>&1 | tee /tmp/stress
It's a changefeed test; but the failures happen always when the changefeed job finished; and we are clearing out job claim
as part of job cancellation.
stress_master.log
Stress run log attached.
Jira issue: CRDB-28136
The text was updated successfully, but these errors were encountered: