jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

dt · 2023-11-03T23:10:40Z

Previously, in some narrow race conditions, a transaction may start at at a timestamp before a migration has run, meaning it would read a state where migration effects, such as creating tables, were not visible, however if migration concurrently executed and completed, any IsActive checks done during that txn might then return true, causing problems where a txn would expect a given table to exist, due to an IsActive check saying it had been created, but then not observe it when it read due its chosen timestamp. This has been previously described in #106764 and #109039.

In #108357 we wrapped the write to the job_info table with MaybeGenerateForcedRetryableError, which meant errors due to a table not existing were always considered retryable. This was done to mitigate this rare race condition, typically only observed in automated tests of upgrades, that interacted with the jobs tables heavily while performing upgrades and as such would fail due to jobs related code that expected the info table to exist (based on IsActive) but failed with errors stating it did not exist.

The change in #108357 however had a significant unintended consequence for executions of job info transaction that were not racing with a version upgrade, i.e. callers who were not executing under IsActive(V23_1) == true check. Several such examples of callers that are not conditional on V23_1 are the jobsprofiler.StorePlanDiagram calls in backup, restore, changefeeds and IMPORT.

When these StorePlanDiagram calls were added, since they were just there to capture nice-to-have debugging information, they were explicitly added as "fire-and-forget" calls, done on a separate goroutine and with any errors logged and discarded, with the intention that they could not negatively impact or affect the code to which they were added in any way. This meant they were able to be added without wrapping them in version gates that would ensure the table to which they'd try to log existed, as any errors would simply be ignored per their contract.

However, the subsequent addition of this retry-on-non-existent-table logic to the info table writing helper -- which is shared by both these calls, which did not mind the errors, and those that do generate and return user-visible errors, where the retry was needed, meant that this retry logic was now being hit by callers beyond those in the narrow race condition of IsActive vs txn time, including these callers that were not checking IsActive at all, and as such could be running on a cluster where the cluster version has not and perhaps will not actually advance or run migrations any time soon.

Concretely, this means that on a cluster running 23.1.11 (the first public release where #108357 was backported) where persevere_downgrade_option or anything else is blocking the version from advancing into the 23.1 migrations that create the info table, any execution planning of changefeed/backup/restore/import job planning will kick off this debug logging task, which will then enter an infinite retry loop until it hits some other, non-retryable error or the version upgrade does eventually happen. This appears to be a very tight loop, consuming significant CPU. It would appear that this would worsen each time a new job kicks off one of these logging tasks.

Jira issue: CRDB-33165

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-11-03T23:10:42Z

Hi @dt, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl · 2023-11-03T23:45:06Z

cc @cockroachdb/disaster-recovery

112954: ui: enable Forward button to set timewindow for latest NOW value r=koorosh a=koorosh Before, "Forward" button on Time selector component allowed to select next time window if there's enough space for full increment (ie with 10 min time window, it wasn't possible to move forward if current end time is less that Now() - 10min). It caused misalignment where Forward button became disabled even if there's some more data to display. This change handles this case and updates current time window to current time (executes the same logic as Now button). Resolves: #112847 Release note (ui change): Forward button on time selector allows to select latest available timewindow (the same as with "Now" button) Release justification: low risk, high benefit changes to existing functionality https://github.com/cockroachdb/cockroach/assets/3106437/00f50793-c327-4902-903b-868ea2000047 113827: build: add BranchReleaseSeries r=RaduBerinde a=RaduBerinde This change adds `build.BranchReleaseSeries()` which returns the major and minor in `version.txt`. This will be used to know the current release series when the latest `clusterversion` is not finalized. We also clean up the code a bit: we separate the variables that are overridden by Bazel, and we use a different variable for the testing override (to make things more clear). Epic: none Release note: None 113864: jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 r=dt a=dt Broken into a couple commits for ease of review: 1) jobs: plumb cluster version to info table accessor This is a pure refactor to plumb a clusterversion.Handle to the info table accessor via all the call sites and wrapping structs/call trees; no behavior change, or usage of the plumbed cv, is added in this commit. 2) jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 3) jobs: only store 23.1 debugging info after 23.1 upgrade Release note (bug fix): fixed a bug that could cause 23.1 nodes in clusters which had not finalized the 23.1 version upgrade to use excessive CPU retrying expected errors related to the incomplete upgrade state. Informs #113795. Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: David Taylor <[email protected]>

dt · 2023-11-07T15:49:37Z

Removing release-blocker from this now that #113929 is in.

dt · 2023-11-10T00:09:57Z

We don't need to backport anything to upcoming 23.2 for this as it does not support upgrades from 22.2 so with the 23.1 backports merged I'm closing this as fixed (those backports should be in 23.1.12 later this month)

dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-jobs release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 3, 2023

blathers-crl bot added the T-jobs label Nov 3, 2023

dt added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.12-rc labels Nov 3, 2023

dt changed the title ~~jobs: infinite loop if version isn't finalized~~ jobs: infinite loop in v23.1.x when job run before cluster version is finalized Nov 3, 2023

dt changed the title ~~jobs: infinite loop in v23.1.x when job run before cluster version is finalized~~ jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized Nov 3, 2023

blathers-crl bot added the T-disaster-recovery label Nov 3, 2023

dt self-assigned this Nov 3, 2023

dt mentioned this issue Nov 6, 2023

jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 #113864

Merged

dt mentioned this issue Nov 7, 2023

backport-23.1: jobs: only store 23.1 debugging info after 23.1 upgrade #113910

Merged

dt removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.12-rc labels Nov 7, 2023

benbardin added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023

exalate-issue-sync bot removed the T-disaster-recovery label Nov 9, 2023

dt closed this as completed Nov 10, 2023

adityamaru mentioned this issue Dec 14, 2023

roachtest: schemachange/mixed-versions failed [job-rows-batch-insert: null value violates not-null constraint] #115747

Closed

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

dt commented Nov 3, 2023 •

edited

Loading

blathers-crl bot commented Nov 3, 2023

blathers-crl bot commented Nov 3, 2023

dt commented Nov 7, 2023

dt commented Nov 10, 2023

jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

Comments

dt commented Nov 3, 2023 • edited Loading

blathers-crl bot commented Nov 3, 2023

blathers-crl bot commented Nov 3, 2023

dt commented Nov 7, 2023

dt commented Nov 10, 2023

dt commented Nov 3, 2023 •

edited

Loading