Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized #113795

Closed
dt opened this issue Nov 3, 2023 · 4 comments
Closed
Assignees
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-jobs

Comments

@dt
Copy link
Member

dt commented Nov 3, 2023

Previously, in some narrow race conditions, a transaction may start at at a timestamp before a migration has run, meaning it would read a state where migration effects, such as creating tables, were not visible, however if migration concurrently executed and completed, any IsActive checks done during that txn might then return true, causing problems where a txn would expect a given table to exist, due to an IsActive check saying it had been created, but then not observe it when it read due its chosen timestamp. This has been previously described in #106764 and #109039.

In #108357 we wrapped the write to the job_info table with MaybeGenerateForcedRetryableError, which meant errors due to a table not existing were always considered retryable. This was done to mitigate this rare race condition, typically only observed in automated tests of upgrades, that interacted with the jobs tables heavily while performing upgrades and as such would fail due to jobs related code that expected the info table to exist (based on IsActive) but failed with errors stating it did not exist.

The change in #108357 however had a significant unintended consequence for executions of job info transaction that were not racing with a version upgrade, i.e. callers who were not executing under IsActive(V23_1) == true check. Several such examples of callers that are not conditional on V23_1 are the jobsprofiler.StorePlanDiagram calls in backup, restore, changefeeds and IMPORT.

When these StorePlanDiagram calls were added, since they were just there to capture nice-to-have debugging information, they were explicitly added as "fire-and-forget" calls, done on a separate goroutine and with any errors logged and discarded, with the intention that they could not negatively impact or affect the code to which they were added in any way. This meant they were able to be added without wrapping them in version gates that would ensure the table to which they'd try to log existed, as any errors would simply be ignored per their contract.

However, the subsequent addition of this retry-on-non-existent-table logic to the info table writing helper -- which is shared by both these calls, which did not mind the errors, and those that do generate and return user-visible errors, where the retry was needed, meant that this retry logic was now being hit by callers beyond those in the narrow race condition of IsActive vs txn time, including these callers that were not checking IsActive at all, and as such could be running on a cluster where the cluster version has not and perhaps will not actually advance or run migrations any time soon.

Concretely, this means that on a cluster running 23.1.11 (the first public release where #108357 was backported) where persevere_downgrade_option or anything else is blocking the version from advancing into the 23.1 migrations that create the info table, any execution planning of changefeed/backup/restore/import job planning will kick off this debug logging task, which will then enter an infinite retry loop until it hits some other, non-retryable error or the version upgrade does eventually happen. This appears to be a very tight loop, consuming significant CPU. It would appear that this would worsen each time a new job kicks off one of these logging tasks.

Jira issue: CRDB-33165

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-jobs release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 3, 2023
Copy link

blathers-crl bot commented Nov 3, 2023

Hi @dt, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the T-jobs label Nov 3, 2023
@dt dt added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.12-rc labels Nov 3, 2023
@dt dt changed the title jobs: infinite loop if version isn't finalized jobs: infinite loop in v23.1.x when job run before cluster version is finalized Nov 3, 2023
@dt dt changed the title jobs: infinite loop in v23.1.x when job run before cluster version is finalized jobs: infinite loop in v23.1.x when jobs run before cluster version is finalized Nov 3, 2023
Copy link

blathers-crl bot commented Nov 3, 2023

cc @cockroachdb/disaster-recovery

@dt dt self-assigned this Nov 3, 2023
craig bot pushed a commit that referenced this issue Nov 6, 2023
112954: ui: enable Forward button to set timewindow for latest NOW value r=koorosh a=koorosh

Before, "Forward" button on Time selector component allowed to select next time window if there's enough space for full increment (ie with 10 min time window, it wasn't possible to move forward if current end time is less that Now() - 10min). It caused misalignment where Forward button became disabled even if there's some more data to display.

This change handles this case and updates current time window to current time (executes the same logic as Now button).

Resolves: #112847

Release note (ui change): Forward button on time selector allows to select latest available timewindow (the same as with "Now" button)

Release justification: low risk, high benefit changes to existing functionality


https://github.com/cockroachdb/cockroach/assets/3106437/00f50793-c327-4902-903b-868ea2000047



113827: build: add BranchReleaseSeries r=RaduBerinde a=RaduBerinde

This change adds `build.BranchReleaseSeries()` which returns the major and minor in `version.txt`. This will be used to know the current release series when the latest `clusterversion` is not finalized.

We also clean up the code a bit: we separate the variables that are overridden by Bazel, and we use a different variable for the testing override (to make things more clear).

Epic: none
Release note: None

113864:   jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1 r=dt a=dt

Broken into a couple commits for ease of review:
1) jobs: plumb cluster version to info table accessor

This is a pure refactor to plumb a clusterversion.Handle to the info table accessor
via all the call sites and wrapping structs/call trees; no behavior change, or usage
of the plumbed cv, is added in this commit.

2) jobs: only force jobs.MaybeGenerateForcedRetryableError in 23.1


3) jobs: only store 23.1 debugging info after 23.1 upgrade

Release note (bug fix): fixed a bug that could cause 23.1 nodes in clusters which had not finalized the 23.1
version upgrade to use excessive CPU retrying expected errors related to the incomplete upgrade state.

Informs #113795.


Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@dt dt removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.12-rc labels Nov 7, 2023
@dt
Copy link
Member Author

dt commented Nov 7, 2023

Removing release-blocker from this now that #113929 is in.

@benbardin benbardin added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023
@dt
Copy link
Member Author

dt commented Nov 10, 2023

We don't need to backport anything to upcoming 23.2 for this as it does not support upgrades from 22.2 so with the 23.1 backports merged I'm closing this as fixed (those backports should be in 23.1.12 later this month)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-jobs
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants