-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server,jobs: Better handle node drain #103709
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better; I like it.
Could we also get a review from @stevendanna who has a keener eye on the topic of channel concurrency. Thanks
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @miretskiy, and @rhu713)
pkg/jobs/registry.go
line 2021 at r1 (raw file):
close(r.drainRequested) defer close(r.drainRequested)
Looks like you're closing the channel twice here.
pkg/jobs/registry.go
line 2031 at r1 (raw file):
t.Reset(maxWait) for numWait > 0 {
You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i didn't mean to block this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @knz, @rhu713, and @stevendanna)
pkg/jobs/registry.go
line 2021 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Looks like you're closing the channel twice here.
Doh; that's what you gate for late sunday PRs. Should have defer closed r.drainJobs instead
pkg/jobs/registry.go
line 2031 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).
Ack.
@knz -- tftr -- okay to revert wait period to 10 seconds as part of this PR? |
Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a |
10s it is; tested by reverting change that had to set it to 0 in drain_test (it would become flaky if job wait logic was as it was prior to this PR) |
8ee48a5
to
394c302
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)
pkg/jobs/registry.go
line 2052 at r3 (raw file):
func (r *Registry) OnDrain() (<-chan struct{}, func()) { r.mu.Lock() r.mu.numDrainWait++
It's a little odd that a job can call OnDrain
and increment r.mu.numDrainWait
in the middle of DrainRequested
being executed by the registry. Say the registry is executing DrainRequested
and the timer is almost finished and a new job calls OnDrain
. The timer will fire immediately and trigger the drain. The job called OnDrain
expecting to have 10 seconds to clean up, but it did not get 10 seconds.
It may be worth having a fast path where OnDrain
returns an error if the registry is already draining (r.mu.draining = true
). You could also change the last case of the select in DrainRequested
to do numWait--
without locking the mutex.
I think it's fine. We are shutting down -- races during that time happen all the time.
I think race detection could still trigger; but regardless, I'm not sure it's worth the effort. |
@stevendanna mind taking a look? |
I think jobs that care about this could immediately check if the returned channel is closed before starting other work. This would be pretty similar to returning a bool or error, just a extra line or code or two. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks reasonable to me. Thanks!
pkg/jobs/registry.go
Outdated
t := timeutil.NewTimer() | ||
defer t.Stop() | ||
t.Reset(maxWait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change this, but since the function already takes a context, the caller could set a deadline on the context rather than having a separate wait_time argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh... I like this a lot. Making this change.
Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain Epic: CRDB-26978 Release note: None
bors r+ |
Build succeeded: |
Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain
Epic: CRDB-26978
Release note: None