-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: "system.job_info does not exist" during cluster upgrade #103239
Comments
cc @cockroachdb/jobs @cockroachdb/disaster-recovery Also directly pinging @adityamaru since you've worked on a lot of the |
Friendly ping @adityamaru. FYI, this also happens in CI (local cluster): #103269 (comment) |
Sorry for the delay on this @renatolabs, other things have bumped this from my list. I will carve out time to dig into this tomorrow. |
I tried writing a unit test that steps through the upgrade while repeatedly calling |
posting a summary of a conversation with @dt @stevendanna @fqazi
The next steps are to gather more debug information here and potentially get some schema folks involved to better understand descriptor leasing. |
A transaction is allowed to keep its commit timestamp when the descriptor lease is dropped, its just not allowed to advance its timestamp passed the lease's expiration time. The read of the descriptor is not done within the txn, so a change to the descriptor does not generate read refresh conflicts at commit time.
As long as the version check is not in a performance critical code path, we could use the utility I created during the 23.1 development cycle for checking a version gate in a txn:
Alternatively, instead of relying exclusively on the version gate, txns could check for the version gate and the existence of the job_info descriptor before reading/writing to the table. |
Hi @adityamaru, do you have an idea for when we'll fix this one? It might be possible to reproduce with |
@DrewKimball I'll look into this this week! The failures on those tests however are of the form |
In a bid to apply a similar fix as #107570 to this test I kicked off a 20 run and only one failed with a dead node error:
I'll chase down this failure before kicking off another repro run. |
20 runs of this doesn't tickle this bug. Kicking off 50 now. |
50 runs on master passed as well 🤔 . I'm going to try on release-23.1 now. |
108357: jobs: fix mixed-version jobs flake r=knz a=adityamaru Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp. As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764. On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake. Fixes: #103239 Fixes: #105032 Release note: None 108583: rangefeed: deflake `TestBudgetReleaseOnOneStreamError` r=erikgrinaker a=erikgrinaker The test could fail with `REASON_SLOW_CONSUMER` if the registration goroutine did not drain the queue in time (1 ms). Increase the timeout. Resolves #108555. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp. As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764. On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake. Fixes: #103239 Fixes: #105032 Release note: None
Similar to #107570 this is a short term fix for when an a query is executed with an AS OF SYSTEM TIME picks a transaction timestamp before the job_info migration has run. In which case parts of the jobs infrastructure will attempt to query the job_info column even though it doesn't exist at the transaction's timestamp. As a short term fix, when we encounter an UndefinedObject error for the job_info table we generate a synthetic retryable error so that the txn is pushed to a higher timestamp at which the upgrade will have completed and the job_info table will be visible. The longer term fix is being tracked in #106764. On master I can no longer reproduce the failure in #105032 but on 23.1 with this change I can successfully run 30 iterations of the test on a seed (-8690666577594439584) which previously saw occurrences of this flake. Fixes: #103239 Fixes: #105032 Release note: None
While testing some changes to the
backup-restore/mixed-version
roachtest, I saw a restore fail with the following error:This seems to happen when a
RESTORE
is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.Reproduction
#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the
backup-restore/mixed-version
test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.
Let me know what else I can do to help debug this.
Update: an easier way to reproduce this bug seems to be by running the simpler
acceptance/version-upgrade
test using a seed that causes the schemachange workload to run concurrently with migrations.-8690666577594439584
is one such seed.[1]
2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600
Jira issue: CRDB-27894
The text was updated successfully, but these errors were encountered: