Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,jobs: Better handle node drain #103709

Merged
merged 1 commit into from
May 23, 2023
Merged

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented May 22, 2023

Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain

Epic: CRDB-26978
Release note: None

@miretskiy miretskiy requested a review from knz May 22, 2023 00:40
@miretskiy miretskiy requested review from a team as code owners May 22, 2023 00:40
@miretskiy miretskiy requested review from rhu713 and jayshrivastava and removed request for a team May 22, 2023 00:40
@blathers-crl
Copy link

blathers-crl bot commented May 22, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

knz
knz previously requested changes May 22, 2023
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better; I like it.
Could we also get a review from @stevendanna who has a keener eye on the topic of channel concurrency. Thanks

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @miretskiy, and @rhu713)


pkg/jobs/registry.go line 2021 at r1 (raw file):

	close(r.drainRequested)
	defer close(r.drainRequested)

Looks like you're closing the channel twice here.


pkg/jobs/registry.go line 2031 at r1 (raw file):

	t.Reset(maxWait)

	for numWait > 0 {

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didn't mean to block this.

@knz knz dismissed their stale review May 22, 2023 09:28

not blocking

@knz knz requested a review from stevendanna May 22, 2023 09:29
@miretskiy miretskiy requested a review from knz May 22, 2023 11:20
Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @knz, @rhu713, and @stevendanna)


pkg/jobs/registry.go line 2021 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Looks like you're closing the channel twice here.

Doh; that's what you gate for late sunday PRs. Should have defer closed r.drainJobs instead


pkg/jobs/registry.go line 2031 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

Ack.

@miretskiy
Copy link
Contributor Author

@knz -- tftr -- okay to revert wait period to 10 seconds as part of this PR?

@knz
Copy link
Contributor

knz commented May 22, 2023

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

@miretskiy
Copy link
Contributor Author

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

10s it is; tested by reverting change that had to set it to 0 in drain_test (it would become flaky if job wait logic was as it was prior to this PR)

@miretskiy miretskiy force-pushed the draino branch 2 times, most recently from 8ee48a5 to 394c302 Compare May 22, 2023 12:07
Copy link
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)


pkg/jobs/registry.go line 2052 at r3 (raw file):

func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

@miretskiy
Copy link
Contributor Author

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)

pkg/jobs/registry.go line 2052 at r3 (raw file):

func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

I think it's fine. We are shutting down -- races during that time happen all the time.
OnDrain is never a guarantee -- so, if the job is relying on 10 seconds -- the job should stop doing that.
It's a best effort affair.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

I think race detection could still trigger; but regardless, I'm not sure it's worth the effort.

@miretskiy
Copy link
Contributor Author

@stevendanna mind taking a look?

@stevendanna
Copy link
Collaborator

stevendanna commented May 23, 2023

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry.

I think jobs that care about this could immediately check if the returned channel is closed before starting other work. This would be pretty similar to returning a bool or error, just a extra line or code or two.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable to me. Thanks!

Comment on lines 2026 to 2028
t := timeutil.NewTimer()
defer t.Stop()
t.Reset(maxWait)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this, but since the function already takes a context, the caller could set a deadline on the context rather than having a separate wait_time argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh... I like this a lot. Making this change.

pkg/jobs/registry.go Show resolved Hide resolved
Rework job registry drain signal to terminate the drain
as soon as the last job that was watching for drain signal
completes its drain

Epic: CRDB-26978
Release note: None
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 23, 2023

Build succeeded:

@craig craig bot merged commit e3b7e0c into cockroachdb:master May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants