jobs, sql: Avoid jobs/scheduled jobs lock up. #78467

miretskiy · 2022-03-25T01:53:16Z

Improve scheduled jobs system stability by removing expensive metrics calculations, and redundant existence check in stats compaction jobs. Each of these could/does result in a table scan, perhaps repetitively.

See commits for details.

Release Justification: scheduled job system stability improvements.

cockroach-teamcity · 2022-03-25T01:53:24Z

This change is

nvanbenschoten

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @shermanCRL)

-- commits, line 28 at r2:
s/aref/are/

pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):

	jobRegistry *jobs.Registry,
) (jobspb.JobID, error) {
	if err := CheckExistingCompactionJob(ctx, nil /* job */, ie, txn); err != nil {

I don't understand this code enough to be able to make a determination about whether this is needed. Why is this safe to remove? Because in this path, the job scheduler is already providing mutual exclusion? If so, is that sufficient to ensure that we don't have a scheduled and non-scheduled compaction job run at the same time?

pkg/jobs/job_scheduler.go

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @nvanbenschoten, and @shermanCRL)

pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't understand this code enough to be able to make a determination about whether this is needed. Why is this safe to remove? Because in this path, the job scheduler is already providing mutual exclusion? If so, is that sufficient to ensure that we don't have a scheduled and non-scheduled compaction job run at the same time?

This method is only called from scheduled job executor. It turns out it is not used otherwise.
And yes, it's safe because scheduler ensures that only 1 instance runs by using jobs table index on "created_by" info set below.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shermanCRL)

-- commits, line 28 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/aref/are/

done.

miretskiy · 2022-03-25T12:49:53Z

I don't understand this code enough to be able to make a determination about whether this is needed. Why is this safe to remove? Because in this path, the job scheduler is already providing mutual exclusion? If so, is that sufficient to ensure that we don't have a scheduled and non-scheduled compaction job run at the same time?

Going to ask @Azhng to take a look.

Azhng

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @nvanbenschoten, and @shermanCRL)

pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

This method is only called from scheduled job executor. It turns out it is not used otherwise.
And yes, it's safe because scheduler ensures that only 1 instance runs by using jobs table index on "created_by" info set below.

I think this method is also called from the Resumer.

If job scheduler only allow 1 instance to run, I think the call in the resumer can also be deleted.

pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 106 at r1 (raw file):

// that are either PAUSED, CANCELED, or RUNNING. If so, it returns a
// ErrConcurrentSQLStatsCompaction.
func CheckExistingCompactionJob(

I guess we can also delete this method now 😛

miretskiy · 2022-03-25T16:14:22Z

I think this method is also called from the Resumer.

If job scheduler only allow 1 instance to run, I think the call in the resumer can also be deleted.

Okay.. Deleting.

Azhng

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @miretskiy, @nvanbenschoten, and @shermanCRL)

pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 84 at r5 (raw file):

// CreateCompactionJob creates a system.jobs record if there is no other
// SQL Stats compaction job running. This is invoked by the scheduled job
// Executor.

nit: I guess this comment might need some update

miretskiy · 2022-03-25T16:23:52Z

nit: I guess this comment might need some update
Fixed.

Azhng

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @miretskiy, @nvanbenschoten, and @shermanCRL)

Scheduled jobs system, by default, ensures that there is only one instance of the job that is currently executing for the schedule. As such, it is not necessary to verify the compaction job does not exist when starting stats compaction job from schedule. Furthermore, due to the interaction of scheduling system, such checks results in wide system.jobs table scan, which causes scheduled job execution to be restarted if any other job modifies system.jobs table. Fixes cockroachdb#78465 Release Notes (sql): Stats compaction scheduled job no longer cause intent buildup. Release Justification: important stability fix to ensure jobs and scheduled jobs do not lock up when running stats compaction job.

Remove `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job scheduler. These metrics are very expensive to compute as they involve running wider table scans against both `system.jobs` and `system.scheduled_job`. In addition to being expensive to compute, these metrics are not needed since the query can be executed directly if needed, and, in addition these metrics are confusing since these metrics are per node, while the number of running jobs/schedules is cluster wide. More importantly, they can lead to job scheduler query being more expensive since they increase the read set of the scheduler transaction, thus causing txn restarts to be more expensive. Fixes cockroachdb#78447 Release Notes (enterprise): Remove expensive, unnecessary, and never used `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job schedulers. Release Justification: Stability fix for scheduled job system.

miretskiy · 2022-03-26T23:30:36Z

bors r+

craig · 2022-03-27T01:22:12Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-03-27T01:22:34Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from d7aaf92 to blathers/backport-release-21.2-78467: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

miretskiy requested a review from nvanbenschoten March 25, 2022 01:53

miretskiy requested a review from a team as a code owner March 25, 2022 01:53

miretskiy requested review from shermanCRL and removed request for a team March 25, 2022 01:53

nvanbenschoten mentioned this pull request Mar 25, 2022

sql: Avoid wide scans of jobs table when starting compaction job. #78465

Closed

nvanbenschoten reviewed Mar 25, 2022

View reviewed changes

shermanCRL approved these changes Mar 25, 2022

View reviewed changes

shermanCRL reviewed Mar 25, 2022

View reviewed changes

pkg/jobs/job_scheduler.go Show resolved Hide resolved

miretskiy requested review from nvanbenschoten and shermanCRL March 25, 2022 11:36

miretskiy commented Mar 25, 2022

View reviewed changes

miretskiy force-pushed the scheduler branch from d7a3472 to 10c8366 Compare March 25, 2022 11:37

miretskiy requested a review from Azhng March 25, 2022 12:50

miretskiy force-pushed the scheduler branch from 10c8366 to 09603e8 Compare March 25, 2022 12:51

shermanCRL approved these changes Mar 25, 2022

View reviewed changes

Azhng reviewed Mar 25, 2022

View reviewed changes

miretskiy force-pushed the scheduler branch from 09603e8 to 250f315 Compare March 25, 2022 16:18

Azhng approved these changes Mar 25, 2022

View reviewed changes

miretskiy force-pushed the scheduler branch from 250f315 to bb5f5fb Compare March 25, 2022 16:30

Azhng approved these changes Mar 25, 2022

View reviewed changes

miretskiy force-pushed the scheduler branch 5 times, most recently from 118f65d to d16863c Compare March 26, 2022 15:59

miretskiy force-pushed the scheduler branch 2 times, most recently from 27c18b0 to d518fef Compare March 26, 2022 18:42

Yevgeniy Miretskiy added 2 commits March 26, 2022 15:53

miretskiy force-pushed the scheduler branch from d518fef to 2a5ff76 Compare March 26, 2022 19:53

miretskiy added backport-22.1.x labels Mar 26, 2022

craig bot merged commit 834eaa0 into cockroachdb:master Mar 27, 2022

blathers-crl bot mentioned this pull request Mar 27, 2022

release-22.1: jobs, sql: Avoid jobs/scheduled jobs lock up. #78565

Merged

miretskiy mentioned this pull request Mar 27, 2022

release-21.2: jobs, sql: Avoid jobs/scheduled jobs lock up #78583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs, sql: Avoid jobs/scheduled jobs lock up. #78467

jobs, sql: Avoid jobs/scheduled jobs lock up. #78467

miretskiy commented Mar 25, 2022 •

edited by shermanCRL

Loading

cockroach-teamcity commented Mar 25, 2022

nvanbenschoten left a comment

miretskiy left a comment

miretskiy left a comment

miretskiy commented Mar 25, 2022

Azhng left a comment

miretskiy commented Mar 25, 2022

Azhng left a comment

miretskiy commented Mar 25, 2022

Azhng left a comment

miretskiy commented Mar 26, 2022

craig bot commented Mar 27, 2022

blathers-crl bot commented Mar 27, 2022

jobs, sql: Avoid jobs/scheduled jobs lock up. #78467

jobs, sql: Avoid jobs/scheduled jobs lock up. #78467

Conversation

miretskiy commented Mar 25, 2022 • edited by shermanCRL Loading

cockroach-teamcity commented Mar 25, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy commented Mar 25, 2022

Azhng left a comment

Choose a reason for hiding this comment

miretskiy commented Mar 25, 2022

Azhng left a comment

Choose a reason for hiding this comment

miretskiy commented Mar 25, 2022

Azhng left a comment

Choose a reason for hiding this comment

miretskiy commented Mar 26, 2022

craig bot commented Mar 27, 2022

blathers-crl bot commented Mar 27, 2022

miretskiy commented Mar 25, 2022 •

edited by shermanCRL

Loading