-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795
Comments
Hi @dt, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cc @cockroachdb/disaster-recovery |
112954: ui: enable Forward button to set timewindow for latest NOW value r=koorosh a=koorosh Before, "Forward" button on Time selector component allowed to select next time window if there's enough space for full increment (ie with 10 min time window, it wasn't possible to move forward if current end time is less that Now() - 10min). It caused misalignment where Forward button became disabled even if there's some more data to display. This change handles this case and updates current time window to current time (executes the same logic as Now button). Resolves: #112847 Release note (ui change): Forward button on time selector allows to select latest available timewindow (the same as with "Now" button) Release justification: low risk, high benefit changes to existing functionality https://github.com/cockroachdb/cockroach/assets/3106437/00f50793-c327-4902-903b-868ea2000047 113827: build: add BranchReleaseSeries r=RaduBerinde a=RaduBerinde This change adds `build.BranchReleaseSeries()` which returns the major and minor in `version.txt`. This will be used to know the current release series when the latest `clusterversion` is not finalized. We also clean up the code a bit: we separate the variables that are overridden by Bazel, and we use a different variable for the testing override (to make things more clear). Epic: none Release note: None 113864: jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 r=dt a=dt Broken into a couple commits for ease of review: 1) jobs: plumb cluster version to info table accessor This is a pure refactor to plumb a clusterversion.Handle to the info table accessor via all the call sites and wrapping structs/call trees; no behavior change, or usage of the plumbed cv, is added in this commit. 2) jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 3) jobs: only store 23.1 debugging info after 23.1 upgrade Release note (bug fix): fixed a bug that could cause 23.1 nodes in clusters which had not finalized the 23.1 version upgrade to use excessive CPU retrying expected errors related to the incomplete upgrade state. Informs #113795. Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: David Taylor <[email protected]>
Removing |
We don't need to backport anything to upcoming 23.2 for this as it does not support upgrades from 22.2 so with the 23.1 backports merged I'm closing this as fixed (those backports should be in 23.1.12 later this month) |
Previously, in some narrow race conditions, a transaction may start at at a timestamp before a migration has run, meaning it would read a state where migration effects, such as creating tables, were not visible, however if migration concurrently executed and completed, any
IsActive
checks done during that txn might then return true, causing problems where a txn would expect a given table to exist, due to an IsActive check saying it had been created, but then not observe it when it read due its chosen timestamp. This has been previously described in #106764 and #109039.In #108357 we wrapped the write to the job_info table with
MaybeGenerateForcedRetryableError
, which meant errors due to a table not existing were always considered retryable. This was done to mitigate this rare race condition, typically only observed in automated tests of upgrades, that interacted with the jobs tables heavily while performing upgrades and as such would fail due to jobs related code that expected the info table to exist (based on IsActive) but failed with errors stating it did not exist.The change in #108357 however had a significant unintended consequence for executions of job info transaction that were not racing with a version upgrade, i.e. callers who were not executing under
IsActive(V23_1) == true
check. Several such examples of callers that are not conditional on V23_1 are thejobsprofiler.StorePlanDiagram
calls in backup, restore, changefeeds and IMPORT.When these
StorePlanDiagram
calls were added, since they were just there to capture nice-to-have debugging information, they were explicitly added as "fire-and-forget" calls, done on a separate goroutine and with any errors logged and discarded, with the intention that they could not negatively impact or affect the code to which they were added in any way. This meant they were able to be added without wrapping them in version gates that would ensure the table to which they'd try to log existed, as any errors would simply be ignored per their contract.However, the subsequent addition of this retry-on-non-existent-table logic to the info table writing helper -- which is shared by both these calls, which did not mind the errors, and those that do generate and return user-visible errors, where the retry was needed, meant that this retry logic was now being hit by callers beyond those in the narrow race condition of IsActive vs txn time, including these callers that were not checking IsActive at all, and as such could be running on a cluster where the cluster version has not and perhaps will not actually advance or run migrations any time soon.
Concretely, this means that on a cluster running 23.1.11 (the first public release where #108357 was backported) where
persevere_downgrade_option
or anything else is blocking the version from advancing into the 23.1 migrations that create the info table, any execution planning of changefeed/backup/restore/import job planning will kick off this debug logging task, which will then enter an infinite retry loop until it hits some other, non-retryable error or the version upgrade does eventually happen. This appears to be a very tight loop, consuming significant CPU. It would appear that this would worsen each time a new job kicks off one of these logging tasks.Jira issue: CRDB-33165
The text was updated successfully, but these errors were encountered: