-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs, sql: Avoid jobs/scheduled jobs lock up. #78467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @shermanCRL)
-- commits, line 28 at r2:
s/aref/are/
pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):
jobRegistry *jobs.Registry, ) (jobspb.JobID, error) { if err := CheckExistingCompactionJob(ctx, nil /* job */, ie, txn); err != nil {
I don't understand this code enough to be able to make a determination about whether this is needed. Why is this safe to remove? Because in this path, the job scheduler is already providing mutual exclusion? If so, is that sufficient to ensure that we don't have a scheduled and non-scheduled compaction job run at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @nvanbenschoten, and @shermanCRL)
pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't understand this code enough to be able to make a determination about whether this is needed. Why is this safe to remove? Because in this path, the job scheduler is already providing mutual exclusion? If so, is that sufficient to ensure that we don't have a scheduled and non-scheduled compaction job run at the same time?
This method is only called from scheduled job executor. It turns out it is not used otherwise.
And yes, it's safe because scheduler ensures that only 1 instance runs by using jobs table index on "created_by" info set below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shermanCRL)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/aref/are/
done.
Going to ask @Azhng to take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @nvanbenschoten, and @shermanCRL)
pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 92 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
This method is only called from scheduled job executor. It turns out it is not used otherwise.
And yes, it's safe because scheduler ensures that only 1 instance runs by using jobs table index on "created_by" info set below.
I think this method is also called from the Resumer.
If job scheduler only allow 1 instance to run, I think the call in the resumer can also be deleted.
pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 106 at r1 (raw file):
// that are either PAUSED, CANCELED, or RUNNING. If so, it returns a // ErrConcurrentSQLStatsCompaction. func CheckExistingCompactionJob(
I guess we can also delete this method now 😛
If job scheduler only allow 1 instance to run, I think the call in the resumer can also be deleted. Okay.. Deleting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @miretskiy, @nvanbenschoten, and @shermanCRL)
pkg/sql/sqlstats/persistedsqlstats/compaction_scheduling.go, line 84 at r5 (raw file):
// CreateCompactionJob creates a system.jobs record if there is no other // SQL Stats compaction job running. This is invoked by the scheduled job // Executor.
nit: I guess this comment might need some update
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @miretskiy, @nvanbenschoten, and @shermanCRL)
118f65d
to
d16863c
Compare
27c18b0
to
d518fef
Compare
Scheduled jobs system, by default, ensures that there is only one instance of the job that is currently executing for the schedule. As such, it is not necessary to verify the compaction job does not exist when starting stats compaction job from schedule. Furthermore, due to the interaction of scheduling system, such checks results in wide system.jobs table scan, which causes scheduled job execution to be restarted if any other job modifies system.jobs table. Fixes cockroachdb#78465 Release Notes (sql): Stats compaction scheduled job no longer cause intent buildup. Release Justification: important stability fix to ensure jobs and scheduled jobs do not lock up when running stats compaction job.
Remove `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job scheduler. These metrics are very expensive to compute as they involve running wider table scans against both `system.jobs` and `system.scheduled_job`. In addition to being expensive to compute, these metrics are not needed since the query can be executed directly if needed, and, in addition these metrics are confusing since these metrics are per node, while the number of running jobs/schedules is cluster wide. More importantly, they can lead to job scheduler query being more expensive since they increase the read set of the scheduler transaction, thus causing txn restarts to be more expensive. Fixes cockroachdb#78447 Release Notes (enterprise): Remove expensive, unnecessary, and never used `schedules.round.schedules-ready-to-run` and `schedules.round.num-jobs-running` metrics from job schedulers. Release Justification: Stability fix for scheduled job system.
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d7aaf92 to blathers/backport-release-21.2-78467: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Improve scheduled jobs system stability by removing expensive metrics calculations, and redundant existence check in stats compaction jobs. Each of these could/does result in a table scan, perhaps repetitively.
See commits for details.
Release Justification: scheduled job system stability improvements.