release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

miretskiy · 2022-03-27T21:04:00Z

Backport 2/2 commits from #78467.

/cc @cockroachdb/release

Improve scheduled jobs system stability by removing expensive metrics calculations, and redundant existence check in stats compaction jobs. Each of these could/does result in a table scan, perhaps repetitively.

See commits for details.

Release Justification: scheduled job system stability improvements.

blathers-crl · 2022-03-27T21:04:03Z

cockroach-teamcity · 2022-03-27T21:04:12Z

This change is

Scheduled jobs system, by default, ensures that there is only one instance of the job that is currently executing for the schedule. As such, it is not necessary to verify the compaction job does not exist when starting stats compaction job from schedule. Furthermore, due to the interaction of scheduling system, such checks results in wide system.jobs table scan, which causes scheduled job execution to be restarted if any other job modifies system.jobs table. Fixes cockroachdb#78465 Release Notes (sql): Stats compaction scheduled job no longer cause intent buildup. Release Justification: important stability fix to ensure jobs and scheduled jobs do not lock up when running stats compaction job.

Remove `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job scheduler. These metrics are very expensive to compute as they involve running wider table scans against both `system.jobs` and `system.scheduled_job`. In addition to being expensive to compute, these metrics are not needed since the query can be executed directly if needed, and, in addition these metrics are confusing since these metrics are per node, while the number of running jobs/schedules is cluster wide. More importantly, they can lead to job scheduler query being more expensive since they increase the read set of the scheduler transaction, thus causing txn restarts to be more expensive. Fixes cockroachdb#78447 Release Notes (enterprise): Remove expensive, unnecessary, and never used `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job schedulers. Release Justification: Stability fix for scheduled job system.

HonoreDB

Reviewed 6 of 6 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

miretskiy requested a review from a team March 27, 2022 21:04

Yevgeniy Miretskiy added 2 commits March 27, 2022 20:57

miretskiy force-pushed the backport21.2-78467 branch from 010b489 to fd07e3e Compare March 28, 2022 00:58

miretskiy requested review from Azhng and HonoreDB March 28, 2022 14:47

HonoreDB approved these changes Mar 28, 2022

View reviewed changes

miretskiy merged commit 48866a2 into cockroachdb:release-21.2 Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

miretskiy commented Mar 27, 2022

blathers-crl bot commented Mar 27, 2022

cockroach-teamcity commented Mar 27, 2022

HonoreDB left a comment

release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

Conversation

miretskiy commented Mar 27, 2022

blathers-crl bot commented Mar 27, 2022

cockroach-teamcity commented Mar 27, 2022

HonoreDB left a comment

Choose a reason for hiding this comment