jobs: "system.job_info does not exist" during cluster upgrade #103239

renatolabs · 2023-05-12T21:44:10Z

While testing some changes to the backup-restore/mixed-version roachtest, I saw a restore fail with the following error:

pq: CreateAdoptableJobInTxn: write-job-info-delete: relation "system.job_info" does not exist

This seems to happen when a RESTORE is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.

Reproduction

#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the backup-restore/mixed-version test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.

For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.

Let me know what else I can do to help debug this.

Update: an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.

[1] 2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600

Jira issue: CRDB-27894

The text was updated successfully, but these errors were encountered:

renatolabs · 2023-05-12T21:45:29Z

cc @cockroachdb/jobs @cockroachdb/disaster-recovery

Also directly pinging @adityamaru since you've worked on a lot of the system.job_info-related changes recently.

renatolabs · 2023-05-25T17:17:13Z

Friendly ping @adityamaru. FYI, this also happens in CI (local cluster): #103269 (comment)

adityamaru · 2023-05-25T21:07:52Z

Sorry for the delay on this @renatolabs, other things have bumped this from my list. I will carve out time to dig into this tomorrow.

adityamaru · 2023-05-29T14:30:42Z

I tried writing a unit test that steps through the upgrade while repeatedly calling CreateAdoptableJobWithTxn. I wasn't able to tickle the failure, so I'm going to just use the roachtest to dig further.

adityamaru · 2023-06-01T14:31:33Z

posting a summary of a conversation with @dt @stevendanna @fqazi

Cluster settings are not scoped to a txn. So it is possible to end up in a situation where we are reading at a historical timestamp but our clusterVersion.IsActive() is reporting true based on a migration that ran at a future timestamp. This problem is not unique to the jobs migration.
In this roachtest failure however it appears to be a write that is occurring at a timestamp before the commit timestamp of the migration that creates the job_info table. This is unusual, because as part of the migration we bump the system.jobs descriptor version and wait for all leases on the older descriptor version to expire. This should ensure that any concurrent operation such as a schema change or restore that is writing to the system.jobs table will have their txn bumped to a timestamp higher than the one that committed the job_info migration. This points to some gap in our understanding of how leases held by a concurrent operation serialize with the migration.

The next steps are to gather more debug information here and potentially get some schema folks involved to better understand descriptor leasing.

jeffswenson · 2023-06-16T14:43:30Z

In this roachtest failure however it appears to be a write that is occurring at a timestamp before the commit timestamp of the migration that creates the job_info table. This is unusual, because as part of the migration we bump the system.jobs descriptor version and wait for all leases on the older descriptor version to expire. This should ensure that any concurrent operation such as a schema change or restore that is writing to the system.jobs table will have their txn bumped to a timestamp higher than the one that committed the job_info migration. This points to some gap in our understanding of how leases held by a concurrent operation serialize with the migration.

A transaction is allowed to keep its commit timestamp when the descriptor lease is dropped, its just not allowed to advance its timestamp passed the lease's expiration time. The read of the descriptor is not done within the txn, so a change to the descriptor does not generate read refresh conflicts at commit time.

cockroach/pkg/sql/catalog/descs/leased_descriptors.go

Line 177 in cf4e6d1

if setDeadline {

As long as the version check is not in a performance critical code path, we could use the utility I created during the 23.1 development cycle for checking a version gate in a txn:

cockroach/pkg/server/settingswatcher/version_guard.go

Line 37 in cf4e6d1

type VersionGuard struct {

Alternatively, instead of relying exclusively on the version gate, txns could check for the version gate and the existence of the job_info descriptor before reading/writing to the table.

DrewKimball · 2023-07-19T18:20:47Z

Hi @adityamaru, do you have an idea for when we'll fix this one? It might be possible to reproduce with TestJSONForwardingIndexes since it's flaking now.

adityamaru · 2023-07-19T19:26:23Z

@DrewKimball I'll look into this this week! The failures on those tests however are of the form system-jobs-scan: column "job_type" does not exist and not system.job_info does not exist. I suspect they fall in a similar class of problems as described by @jeffswenson over here - https://cockroachlabs.slack.com/archives/C04U1BTF8/p1689269073307619 but I'm not clear if we're pursuing a one-size fits all solution at the moment.

adityamaru · 2023-07-28T13:45:02Z

In a bid to apply a similar fix as #107570 to this test I kicked off a 20 run and only one failed with a dead node error:

ERROR: a panic has occurred!
* use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:884
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
*   |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
*   |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:154
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:878
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   |   GOROOT/src/runtime/asm_amd64.s:1594
* Wraps: (4) use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
*

I'll chase down this failure before kicking off another repro run.

adityamaru · 2023-07-28T13:47:44Z

Looks like #107242 saw a similar panic and the fix is being tracked in #107521.

adityamaru · 2023-07-28T14:22:16Z

an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.

20 runs of this doesn't tickle this bug. Kicking off 50 now.

adityamaru · 2023-07-28T16:34:12Z

50 runs on master passed as well 🤔 . I'm going to try on release-23.1 now.

108357: jobs: fix mixed-version jobs flake r=knz a=adityamaru Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp. As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764. On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake. Fixes: #103239 Fixes: #105032 Release note: None 108583: rangefeed: deflake `TestBudgetReleaseOnOneStreamError` r=erikgrinaker a=erikgrinaker The test could fail with `REASON_SLOW_CONSUMER` if the registration goroutine did not drain the queue in time (1 ms). Increase the timeout. Resolves #108555. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp. As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764. On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake. Fixes: #103239 Fixes: #105032 Release note: None

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-jobs T-jobs labels May 12, 2023

This was referenced May 15, 2023

roachtest: acceptance/version-upgrade failed #103269

Closed

roachtest: change some cluster settings during mixed-version backups #103228

Merged

adityamaru mentioned this issue May 29, 2023

roachtest: backup-restore/mixed-version failed #103937

Closed

adityamaru self-assigned this May 29, 2023

renatolabs mentioned this issue Jun 16, 2023

roachtest: acceptance/version-upgrade failed #105032

Closed

michae2 mentioned this issue Jul 12, 2023

upgrade/upgrades: TestJSONForwardingIndexes failed #106365

Closed

DrewKimball mentioned this issue Jul 19, 2023

upgrade/upgrades: TestJSONForwardingIndexes failed #107169

Closed

pav-kv mentioned this issue Aug 7, 2023

kvserver: span assertions fire in replication layer #107521

Closed

adityamaru mentioned this issue Aug 8, 2023

jobs: fix mixed-version jobs flake #108357

Merged

craig bot closed this as completed in 79cc2ea Aug 12, 2023

blathers-crl bot mentioned this issue Aug 12, 2023

release-23.1: jobs: fix mixed-version jobs flake #108664

Closed

adityamaru mentioned this issue Dec 14, 2023

roachtest: schemachange/mixed-versions failed [job-rows-batch-insert: null value violates not-null constraint] #115747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: "system.job_info does not exist" during cluster upgrade #103239

jobs: "system.job_info does not exist" during cluster upgrade #103239

renatolabs commented May 12, 2023 •

edited

Loading

renatolabs commented May 12, 2023

renatolabs commented May 25, 2023

adityamaru commented May 25, 2023

adityamaru commented May 29, 2023 •

edited

Loading

adityamaru commented Jun 1, 2023

jeffswenson commented Jun 16, 2023

DrewKimball commented Jul 19, 2023

adityamaru commented Jul 19, 2023

adityamaru commented Jul 28, 2023

adityamaru commented Jul 28, 2023 •

edited

Loading

adityamaru commented Jul 28, 2023

adityamaru commented Jul 28, 2023

jobs: "system.job_info does not exist" during cluster upgrade #103239

jobs: "system.job_info does not exist" during cluster upgrade #103239

Comments

renatolabs commented May 12, 2023 • edited Loading

Reproduction

renatolabs commented May 12, 2023

renatolabs commented May 25, 2023

adityamaru commented May 25, 2023

adityamaru commented May 29, 2023 • edited Loading

adityamaru commented Jun 1, 2023

jeffswenson commented Jun 16, 2023

DrewKimball commented Jul 19, 2023

adityamaru commented Jul 19, 2023

adityamaru commented Jul 28, 2023

adityamaru commented Jul 28, 2023 • edited Loading

adityamaru commented Jul 28, 2023

adityamaru commented Jul 28, 2023

renatolabs commented May 12, 2023 •

edited

Loading

adityamaru commented May 29, 2023 •

edited

Loading

adityamaru commented Jul 28, 2023 •

edited

Loading