Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: "system.job_info does not exist" during cluster upgrade #103239

Closed
renatolabs opened this issue May 12, 2023 · 12 comments · Fixed by #108357
Closed

jobs: "system.job_info does not exist" during cluster upgrade #103239

renatolabs opened this issue May 12, 2023 · 12 comments · Fixed by #108357
Assignees
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-jobs

Comments

@renatolabs
Copy link
Contributor

renatolabs commented May 12, 2023

While testing some changes to the backup-restore/mixed-version roachtest, I saw a restore fail with the following error:

pq: CreateAdoptableJobInTxn: write-job-info-delete: relation "system.job_info" does not exist

This seems to happen when a RESTORE is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.

Reproduction

#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the backup-restore/mixed-version test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.

For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.

Let me know what else I can do to help debug this.

Update: an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.

[1] 2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600

Jira issue: CRDB-27894

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-jobs T-jobs labels May 12, 2023
@renatolabs
Copy link
Contributor Author

cc @cockroachdb/jobs @cockroachdb/disaster-recovery

Also directly pinging @adityamaru since you've worked on a lot of the system.job_info-related changes recently.

@renatolabs
Copy link
Contributor Author

Friendly ping @adityamaru. FYI, this also happens in CI (local cluster): #103269 (comment)

@adityamaru
Copy link
Contributor

Sorry for the delay on this @renatolabs, other things have bumped this from my list. I will carve out time to dig into this tomorrow.

@adityamaru
Copy link
Contributor

adityamaru commented May 29, 2023

I tried writing a unit test that steps through the upgrade while repeatedly calling CreateAdoptableJobWithTxn. I wasn't able to tickle the failure, so I'm going to just use the roachtest to dig further.

@adityamaru
Copy link
Contributor

posting a summary of a conversation with @dt @stevendanna @fqazi

  • Cluster settings are not scoped to a txn. So it is possible to end up in a situation where we are reading at a historical timestamp but our clusterVersion.IsActive() is reporting true based on a migration that ran at a future timestamp. This problem is not unique to the jobs migration.

  • In this roachtest failure however it appears to be a write that is occurring at a timestamp before the commit timestamp of the migration that creates the job_info table. This is unusual, because as part of the migration we bump the system.jobs descriptor version and wait for all leases on the older descriptor version to expire. This should ensure that any concurrent operation such as a schema change or restore that is writing to the system.jobs table will have their txn bumped to a timestamp higher than the one that committed the job_info migration. This points to some gap in our understanding of how leases held by a concurrent operation serialize with the migration.

The next steps are to gather more debug information here and potentially get some schema folks involved to better understand descriptor leasing.

@jeffswenson
Copy link
Collaborator

In this roachtest failure however it appears to be a write that is occurring at a timestamp before the commit timestamp of the migration that creates the job_info table. This is unusual, because as part of the migration we bump the system.jobs descriptor version and wait for all leases on the older descriptor version to expire. This should ensure that any concurrent operation such as a schema change or restore that is writing to the system.jobs table will have their txn bumped to a timestamp higher than the one that committed the job_info migration. This points to some gap in our understanding of how leases held by a concurrent operation serialize with the migration.

A transaction is allowed to keep its commit timestamp when the descriptor lease is dropped, its just not allowed to advance its timestamp passed the lease's expiration time. The read of the descriptor is not done within the txn, so a change to the descriptor does not generate read refresh conflicts at commit time.

As long as the version check is not in a performance critical code path, we could use the utility I created during the 23.1 development cycle for checking a version gate in a txn:

type VersionGuard struct {

Alternatively, instead of relying exclusively on the version gate, txns could check for the version gate and the existence of the job_info descriptor before reading/writing to the table.

@DrewKimball
Copy link
Collaborator

Hi @adityamaru, do you have an idea for when we'll fix this one? It might be possible to reproduce with TestJSONForwardingIndexes since it's flaking now.

@adityamaru
Copy link
Contributor

@DrewKimball I'll look into this this week! The failures on those tests however are of the form system-jobs-scan: column "job_type" does not exist and not system.job_info does not exist. I suspect they fall in a similar class of problems as described by @jeffswenson over here - https://cockroachlabs.slack.com/archives/C04U1BTF8/p1689269073307619 but I'm not clear if we're pursuing a one-size fits all solution at the moment.

@adityamaru
Copy link
Contributor

In a bid to apply a similar fix as #107570 to this test I kicked off a 20 run and only one failed with a dead node error:

ERROR: a panic has occurred!
* use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:884
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
*   |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
*   |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:154
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:878
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
*   |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   |   GOROOT/src/runtime/asm_amd64.s:1594
* Wraps: (4) use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
*

I'll chase down this failure before kicking off another repro run.

@adityamaru
Copy link
Contributor

adityamaru commented Jul 28, 2023

Looks like #107242 saw a similar panic and the fix is being tracked in #107521.

@adityamaru
Copy link
Contributor

an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.

20 runs of this doesn't tickle this bug. Kicking off 50 now.

@adityamaru
Copy link
Contributor

50 runs on master passed as well 🤔 . I'm going to try on release-23.1 now.

craig bot pushed a commit that referenced this issue Aug 12, 2023
108357: jobs: fix mixed-version jobs flake r=knz a=adityamaru

Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp.

As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764.

On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake.

Fixes: #103239
Fixes: #105032

Release note: None

108583: rangefeed: deflake `TestBudgetReleaseOnOneStreamError` r=erikgrinaker a=erikgrinaker

The test could fail with `REASON_SLOW_CONSUMER` if the registration goroutine did not drain the queue in time (1 ms). Increase the timeout.

Resolves #108555.

Epic: none
Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@craig craig bot closed this as completed in 79cc2ea Aug 12, 2023
blathers-crl bot pushed a commit that referenced this issue Aug 12, 2023
Similar to #107570
this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME
picks a transaction timestamp before the job_info migration has run.
In which case parts of the jobs infrastructure will attempt to query
the job_info column even though it doesn't exist at the transaction's timestamp.

As a short term fix, when we encounter an UndefinedObject error for the job_info table
we generate a synthetic retryable error so that the txn is pushed to a higher timestamp
at which the upgrade will have completed and the job_info table will be visible.
The longer term fix is being tracked in #106764.

On master I can no longer reproduce the failure in #105032 but
on 23.1 with this change I can successfully run 30 iterations of the test
on a seed (-8690666577594439584) which previously saw occurrences
of this flake.

Fixes: #103239
Fixes: #105032

Release note: None
adityamaru added a commit that referenced this issue Aug 18, 2023
Similar to #107570
this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME
picks a transaction timestamp before the job_info migration has run.
In which case parts of the jobs infrastructure will attempt to query
the job_info column even though it doesn't exist at the transaction's timestamp.

As a short term fix, when we encounter an UndefinedObject error for the job_info table
we generate a synthetic retryable error so that the txn is pushed to a higher timestamp
at which the upgrade will have completed and the job_info table will be visible.
The longer term fix is being tracked in #106764.

On master I can no longer reproduce the failure in #105032 but
on 23.1 with this change I can successfully run 30 iterations of the test
on a seed (-8690666577594439584) which previously saw occurrences
of this flake.

Fixes: #103239
Fixes: #105032

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-jobs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants