Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,kvserver: wait for liveness record refresh on other nodes at the end of drain #55460

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 12, 2020

This implements solution 7 discussed in #52593, as proposed by @nvanbenschoten and @andreimatei.

And thus this PR helps with #58492 #59094.

Prior to this patch, it was possible for a node to shut down
gracefully "too quickly", before the other nodes got a chance to see
that the node has gone away.

In particular it was possible:

  • while the node was pushing leases away, it was possible for
    the other nodes to push them back (store rebalance / allocator).
    This is because the other nodes did not yet have a copy
    of the updated node descriptor marked "draining".

  • after the node had moved its leases away and stopped, it was
    possible for range caches on other nodes to continue to try to use
    replicas on the drained node.

To alleviate both issues, this commit makes the drain process wait
until the expiry deadline on the draining nodes' liveness, plus 5
seconds, before starting to transfer leases away. This way, there is
confidence during the lease transfer that the other nodes know the
draining node is, in fact, draining, and will not be considered as a
transfer target.

Additionally, an additional wait of 5 seconds is added at the
very end after all leases have transferred, so that if another
node still finds itself wanting to address a replica
on the now-drained node, it gets a chance to get a
NodeLeaseHolderError and a redirect to the new leaseholder.

@knz knz requested review from tbg and nvanbenschoten October 12, 2020 19:46
@knz knz requested a review from a team as a code owner October 12, 2020 19:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good. I think the second commit is quite unit testable, so we should do that, and for the first one I hope that we can put all of the sleeping and padding to the caller, and pass the work counter around with the reporter to make the code clearer.

I would like to know if we are happy with the roachtest kv/gracefuldraining as the primary way to verify this PR to be effective?

Assuming we have unit tests for the second commit, that sounds good to me. That test is quite specifically designed to catch problems here.

Reviewed 4 of 4 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 540 at r2 (raw file):

	curDiversityScore := rangeDiversityScore(existingStoreLocalities)
	for _, store := range allStores.stores {
		if isNodeValidForRoutineReplicaTransfer != nil && !isNodeValidForRoutineReplicaTransfer(ctx, store.Node.NodeID) {

These could use unit tests. We are already unit testing this method, so hopefully it's not too onerous.


pkg/kv/kvserver/node_liveness.go, line 350 at r1 (raw file):

	ticker := time.NewTicker(time.Second)
	defer ticker.Stop()

You are not using this ticker?

Also, immediately when seeing this code, I was hoping that it could just return a duration, so that we can keep all the sleeping to the caller. (In particular, since the caller already has a random 5s sleep, let's have it add the 5s to this sleep here as well to the duration returned from this method).


pkg/server/drain.go, line 159 at r1 (raw file):

	reports := make(map[redact.SafeString]int)
	var mu syncutil.Mutex
	reporter := func(howMany int, what redact.SafeString) {

Rather than slapping a counter on *Server that will be hard to understand for future readers, can you make the "reporter" a struct { gracefulDrainingWork int64 /* internal */ } with a few helpful methods on it?


pkg/server/drain.go, line 256 at r1 (raw file):

		// acknowledged the draining state.
		//
		// We use gracefulDrainingWork to only perform this wait but only

Sentence is garbled.


pkg/server/drain.go, line 278 at r1 (raw file):

		// so that any stray range leases gets a chance to see a
		// NodeNotLeaseHolderError.
		log.Infof(ctx, "waiting a little more before shutting down the process")

nit: this message could sound less informal. How about

const extraSleep = 5*time.Second
log.Infof(ctx, "waiting %s for draining status to propagate across cluster")

In particular, I would avoid the reference to "shutdown" since we can drain a node without shutting it down.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/kv/kvserver/allocator_scorer.go, line 423 at r2 (raw file):

			continue
		}
		if isNodeValidForRoutineReplicaTransfer != nil && !isNodeValidForRoutineReplicaTransfer(ctx, s.Node.NodeID) {

Does it seem feasible to make isNodeValidForRoutineReplicaTransfer non-nullable? The more optional args we add to these functions, the more complex they become.


pkg/kv/kvserver/allocator_scorer.go, line 540 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

These could use unit tests. We are already unit testing this method, so hopefully it's not too onerous.

+1


pkg/kv/kvserver/store_pool.go, line 819 at r2 (raw file):

) bool {
	timeUntilStoreDead := TimeUntilStoreDead.Get(&sp.st.SV)
	now := sp.clock.Now().GoTime()

This should use PhysicalTime so that it doesn't have to synchronize with the HLC. The other uses of Now().GoTime() in this file probably should too.


pkg/server/drain.go, line 159 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Rather than slapping a counter on *Server that will be hard to understand for future readers, can you make the "reporter" a struct { gracefulDrainingWork int64 /* internal */ } with a few helpful methods on it?

I also found this hard to understand and like Tobi's suggestion. It makes it clear that the counter is scoped to a single reporter. It also allows us to avoid the atomic (I think), because this reporter doesn't need to be thread-safe. Or does it? In fact, why is there a lock here? I don't see any cases where this reporter is passed off to other goroutines.


pkg/server/drain.go, line 200 at r1 (raw file):

	// draining and don't accept to refresh their leases
	// on the liveness range.)
	if err := s.node.SetDraining(false /* drain */, reporter); err != nil {

Is this just for testing? Why would someone want to call this in a loop? The comment is helpful but doesn't fully point me in the right direction.


pkg/server/drain.go, line 257 at r1 (raw file):

		//
		// We use gracefulDrainingWork to only perform this wait but only
		// the first time drainNode() is called.

drainNode can be called multiple times? The use of this counter to make decisions puts a bad taste in my mouth. That's either because I don't understand its purpose entirely or because it shouldn't live on the Store.


pkg/server/drain.go, line 273 at r1 (raw file):

	}

	if atomic.LoadInt64(&s.gracefulDrainingWork) == 0 {

I must be misunderstanding this because the atomic load on a counter very strongly implies some async work, but I don't see any.


pkg/server/drain.go, line 277 at r1 (raw file):

		// shut down. Just before doing so however, wait a little bit more
		// so that any stray range leases gets a chance to see a
		// NodeNotLeaseHolderError.

NotLeaseHolderError

@knz knz force-pushed the 20201012-liveness branch from 5138627 to 8ae1b02 Compare October 15, 2020 10:19
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the 1st commit as suggested. Now working on the 2nd one and its unit tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/node_liveness.go, line 350 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You are not using this ticker?

Also, immediately when seeing this code, I was hoping that it could just return a duration, so that we can keep all the sleeping to the caller. (In particular, since the caller already has a random 5s sleep, let's have it add the 5s to this sleep here as well to the duration returned from this method).

Done.


pkg/server/drain.go, line 159 at r1 (raw file):
Good idea. Done.

this reporter doesn't need to be thread-safe. Or does it? In fact, why is there a lock here? I don't see any cases where this reporter is passed off to other goroutines.

I didn't want to create assumptions here about how the "deeper" layers in the KV package are going to use concurrency or not to perform the drain process.


pkg/server/drain.go, line 200 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this just for testing? Why would someone want to call this in a loop? The comment is helpful but doesn't fully point me in the right direction.

The drain process calls Drain() in a loop (both at the top level in start.go, and in the Drain RPC called by cockroach quit). This is the new graceful shutdown sequence since about December 2019.

The lack of this call to node.SetDraining() was actually a hidden bug. We did not encounter it in our testing because the drain was fast enough that the range lease on liveness was not expiring across iterations. The moment it did however, the drain call would deadlock.

The remainder of my change in this PR (adding an additional delayed) revealed that deadlock, hence this fix.


pkg/server/drain.go, line 256 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sentence is garbled.

Removed.


pkg/server/drain.go, line 257 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

drainNode can be called multiple times? The use of this counter to make decisions puts a bad taste in my mouth. That's either because I don't understand its purpose entirely or because it shouldn't live on the Store.

Fixed.


pkg/server/drain.go, line 273 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I must be misunderstanding this because the atomic load on a counter very strongly implies some async work, but I don't see any.

That wasn't necessary indeed.


pkg/server/drain.go, line 278 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: this message could sound less informal. How about

const extraSleep = 5*time.Second
log.Infof(ctx, "waiting %s for draining status to propagate across cluster")

In particular, I would avoid the reference to "shutdown" since we can drain a node without shutting it down.

Good idea. Done

@knz knz force-pushed the 20201012-liveness branch from 8ae1b02 to 1f28b64 Compare October 15, 2020 12:25
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with tests. RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)

@knz knz force-pushed the 20201012-liveness branch from 1f28b64 to 521e590 Compare October 15, 2020 12:26
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/allocator_scorer.go, line 423 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does it seem feasible to make isNodeValidForRoutineReplicaTransfer non-nullable? The more optional args we add to these functions, the more complex they become.

Done


pkg/kv/kvserver/allocator_scorer.go, line 540 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

+1

Done.


pkg/kv/kvserver/store_pool.go, line 819 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This should use PhysicalTime so that it doesn't have to synchronize with the HLC. The other uses of Now().GoTime() in this file probably should too.

The comment in node_liveness.go says:

// NOTE: If one is interested whether the Liveness is valid currently, then the
// timestamp passed in should be the known high-water mark of all the clocks of
// the nodes in the cluster. For example, if the liveness expires at ts 100, our
// physical clock is at 90, but we know that another node's clock is at 110,
// then it's preferable (more consistent across nodes) for the liveness to be
// considered expired. For that purpose, it's better to pass in
// clock.Now().GoTime() rather than clock.PhysicalNow() - the former takes into
// consideration clock signals from other nodes, the latter doesn't.

Are you sure about PhysicalTime()?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 200 at r1 (raw file):
Something seems fishy to me here. We're starting the draining process by undoing something that we might have previously done towards draining... It doesn't feel right. We need to call this SetDraining(false) because, otherwise, we can't transfer leases anymore (for some reason that's ultimately caused by nobody wanting to take a lease on the liveness range), right?
The way I see it, the real problem is that nobody wants to take a lease on the liveness range. That's what #55624 tries to fix, and so I hope that with that PR, this code would no longer be necessary. My thinking is that it's straight up bad that we can get into a situation where there is quorum for a range, but nobody wants to take the damn lease (or, in the single-node case, that translates to the node being alive but refusing to take the lease). A draining node would rather not take new leases, but if there's a good reason to take one, then it should take it - draining be damned. And a good reason would be the fact that there's no other node that's alive and not draining that could take that lease. This condition can be codified as having the leadership - assuming that we had a mechanism which tries to move leadership for ranges with no leases away from draining nodes onto non-draining nodes (which unfortunately I think we currently do not have because the relevant code doesn't do anything when there's no lease).
Anyway, long story short, I think we should just do #55624 and make sure that as long as there's quorum, the lower layers of the system just work regardless of draining.

The drain process calls Drain() in a loop (both at the top level in start.go, and in the Drain RPC called by cockroach quit). This is the new graceful shutdown sequence since about December 2019.

Separately, I was actually meaning to talk to you about this. It's not directly related to this PR, but I think this is good place to discuss it. Do we like this calling Server.Drain() in a loop? Each iteration starts over (and, in fact, this code that we're debating here technically makes it backtrack on what it had already done before)... The way I'm thinking Server.Drain() should work is that it should start an async process that runs forever and keeps trying to transfer leases away. If/when the higher layer has had enough, it can kill the process (in the case of a second SIGINT, or whatever). Basically, I'd ask what the point of returning from Server.Drain(), particularly returning before all the leases have been transferred, is?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 200 at r1 (raw file):

We need to call this SetDraining(false) because, otherwise, we can't transfer leases anymore (for some reason that's ultimately caused by nobody wanting to take a lease on the liveness range), right?

I've re-read this a bunch and scratched my head to figure out how come this patch even fixes the deadlock, given that, with the patch, s.doDrain() calls s.node.SetDraining(false), but then quickly calls s.drainNode() which quickly calls s.node.SetDraining(true). I believe the answer has to be that the patch simply makes the s.nodeLiveness.SetDraining(ctx, true /* drain */, reporter) call work - which is done in the brief period between s.node.SetDraining(false) - s.node.SetDraining(true). Right?
If that's so, it seems to me that the s.node.SetDraining(false) call could be localized to s.drainNode(), which would be better. Or even better, we could elide doing the s.nodeLiveness.SetDraining(true /* drain */) call when our liveness record already has the draining bit set. Right?

But the more I look at it, the less I like it, and the more I think that the problem should be solved at a lower level.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 200 at r1 (raw file):

I'd ask what the point of returning from Server.Drain(), particularly returning before all the leases have been transferred, is?

To display a periodic progress indicator to the log file and the user of cockroach quit. This is actually a solution to a major UX shortcoming of the previous approach, which wouldn't let the operator see the difference between a well-functioning but long drain process and one that's deadlocked.

I believe the answer has to be that the patch simply makes the s.nodeLiveness.SetDraining(ctx, true /* drain */, reporter) call work - which is done in the brief period between s.node.SetDraining(false) - s.node.SetDraining(true). Right?

yes I think that's what I intended to ensure that works.

we could elide doing the s.nodeLiveness.SetDraining(true /* drain */) call when our liveness record already has the draining bit set.

That would be ideal. But how? I can't read the record if the lease on the range has expired; the store won't let me!

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding at a higher level:

But the more I look at it, the less I like it, and the more I think that the problem should be solved at a lower level.

I'll be frank: I don't care as much as you do how we solve this; I just care about solving the proximate problem (let other nodes with a cached record see a NLE) and the other problem I found while investigating (allocator / rebalancer using draining nodes as targets).

If you don't like my solution here, then just tell me what you want me to do. I don't know myself, and nothing in the code is telling me. If you don't know better than I do, then ... what exactly? Am I supposed to scratch my head and mash the keyboard randomly until a good solution occurs? I don't understand 60% of the code that I'm trying to fix here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r3, 7 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/store_pool.go, line 819 at r2 (raw file):

Previously, knz (kena) wrote…

The comment in node_liveness.go says:

// NOTE: If one is interested whether the Liveness is valid currently, then the
// timestamp passed in should be the known high-water mark of all the clocks of
// the nodes in the cluster. For example, if the liveness expires at ts 100, our
// physical clock is at 90, but we know that another node's clock is at 110,
// then it's preferable (more consistent across nodes) for the liveness to be
// considered expired. For that purpose, it's better to pass in
// clock.Now().GoTime() rather than clock.PhysicalNow() - the former takes into
// consideration clock signals from other nodes, the latter doesn't.

Are you sure about PhysicalTime()?

TIL! Thanks for pointing that out. I guess sp.clock.Now().GoTime() is right.


pkg/kv/kvserver/store_rebalancer.go, line 530 at r4 (raw file):

						reason = redact.Sprintf(" (qps %.2f vs max %.2f)", storeDesc.Capacity.QueriesPerSecond, maxQPS)
					}
					log.VEventf(ctx, 3, "keeping r%d/%d on s%d%s", desc.RangeID, currentReplicas[i].ReplicaID, currentReplicas[i].StoreID, reason)

Does this need to be &reason for now?


pkg/server/drain.go, line 159 at r1 (raw file):

Previously, knz (kena) wrote…

Good idea. Done.

this reporter doesn't need to be thread-safe. Or does it? In fact, why is there a lock here? I don't see any cases where this reporter is passed off to other goroutines.

I didn't want to create assumptions here about how the "deeper" layers in the KV package are going to use concurrency or not to perform the drain process.

I lean towards thinking that if deeper layers in the KV package want to use concurrency, they can perform their own synchronization on top of the drainProgress. But if you don't like that, then I guess document that the type is thread-safe.


pkg/server/drain.go, line 164 at r4 (raw file):

	// We use a closure to ensure the defer above is run in all return
	// paths, including errors.

Huh? Why wouldn't the defer be run in all return paths?


pkg/server/drain.go, line 337 at r4 (raw file):

		remaining += uint64(howMany)
		descBuf.Printf("%s%s: %d", comma, what, howMany)
		comma = ", "

Does this need to be redact.SafeString(", ")?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking strictly about the s.node.SetDraining(false /* drain */) call, or something else? I haven't looked in detail at the rest of the PR, everything I've been talking about was strictly about that line.
And there the suggestion is to simply drop the change and rely on #55624 instead...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 200 at r1 (raw file):

I'd ask what the point of returning from Server.Drain(), particularly returning before all the leases have been transferred, is?

To display a periodic progress indicator to the log file and the user of cockroach quit. This is actually a solution to a major UX shortcoming of the previous approach, which wouldn't let the operator see the difference between a well-functioning but long drain process and one that's deadlocked.

Sure, but you can have an async process do the transfers and report the progress on a channel to the caller. In fact I though that was the point of the progress reporting callback that I've seen passed around.

we could elide doing the s.nodeLiveness.SetDraining(true /* drain */) call when our liveness record already has the draining bit set.

That would be ideal. But how? I can't read the record if the lease on the range has expired; the store won't let me!

Wouldn't it be enough to call s.nodeLiveness.Self(), which gets its info either from gossip, or directly from the last record written if the current node was the last writer?
In any case, my first preference is to get rid of this change and rely on #55624 (assuming it proves to work, of course)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r3, 7 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/node_liveness.go, line 355 at r3 (raw file):

		// Our liveness record does not exist yet? This is surprising,
		// but it does mean we have nothing to do here.
		log.Infof(ctx, "no liveness record on this node, no expiry to wait on")

Logging from a method that also returns an error is (at least to me) an anti-pattern. The method could be called frequently (or so we have to assume, given its generality). I would remove the log line. Besides, you could anchor this method on a liveness record and you would sidestep this question. The drain path would then to handle the case in which the record is not available, which seems cleaner.


pkg/kv/kvserver/node_liveness.go, line 358 at r3 (raw file):

		return 0, nil
	}
	if !liveness.Draining {

This is surprising logic given what this method claims to do. I think it should move to the caller (at which point there's nnot much left of this method) or go away. My preference is the latter - if someone calls this, they're draining.


pkg/server/drain.go, line 200 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd ask what the point of returning from Server.Drain(), particularly returning before all the leases have been transferred, is?

To display a periodic progress indicator to the log file and the user of cockroach quit. This is actually a solution to a major UX shortcoming of the previous approach, which wouldn't let the operator see the difference between a well-functioning but long drain process and one that's deadlocked.

Sure, but you can have an async process do the transfers and report the progress on a channel to the caller. In fact I though that was the point of the progress reporting callback that I've seen passed around.

we could elide doing the s.nodeLiveness.SetDraining(true /* drain */) call when our liveness record already has the draining bit set.

That would be ideal. But how? I can't read the record if the lease on the range has expired; the store won't let me!

Wouldn't it be enough to call s.nodeLiveness.Self(), which gets its info either from gossip, or directly from the last record written if the current node was the last writer?
In any case, my first preference is to get rid of this change and rely on #55624 (assuming it proves to work, of course)

I think we should table the discussion about an async worker goroutine for now. It has its merits but is also more engineering work that is best tackled separately if we decide that it buys us something new.

I agree though that we should get to a place where subsequent Drain invocations work without this SetDraining(false) call (or at least it's crystal clear why we can't do better right now). I haven't exactly understood why it is necessary in the first place. According to Andrei, it's to make (*NodeLiveness).SetDraining work but I don't understand why that call would have issues on an already draining node. That node's stores may not get new leases any more, but the other nodes still will, so why exactly are we deadlocking?


pkg/server/drain.go, line 164 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Huh? Why wouldn't the defer be run in all return paths?

I also don't understand this.

@tbg tbg self-requested a review October 20, 2020 11:14
@knz knz force-pushed the 20201012-liveness branch from 521e590 to 7f18576 Compare October 20, 2020 13:10
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/node_liveness.go, line 355 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Logging from a method that also returns an error is (at least to me) an anti-pattern. The method could be called frequently (or so we have to assume, given its generality). I would remove the log line. Besides, you could anchor this method on a liveness record and you would sidestep this question. The drain path would then to handle the case in which the record is not available, which seems cleaner.

The logging was a left over from a previous version of this code. Removed.


pkg/kv/kvserver/node_liveness.go, line 358 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is surprising logic given what this method claims to do. I think it should move to the caller (at which point there's nnot much left of this method) or go away. My preference is the latter - if someone calls this, they're draining.

This was meant as an assertion. Clarified with errors.AssertionFailedf.


pkg/kv/kvserver/store_rebalancer.go, line 530 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this need to be &reason for now?

nah the indirection is only needed for redact.StringBuilder. RedactableString is fine.


pkg/server/drain.go, line 159 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I lean towards thinking that if deeper layers in the KV package want to use concurrency, they can perform their own synchronization on top of the drainProgress. But if you don't like that, then I guess document that the type is thread-safe.

Done.


pkg/server/drain.go, line 200 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think we should table the discussion about an async worker goroutine for now. It has its merits but is also more engineering work that is best tackled separately if we decide that it buys us something new.

I agree though that we should get to a place where subsequent Drain invocations work without this SetDraining(false) call (or at least it's crystal clear why we can't do better right now). I haven't exactly understood why it is necessary in the first place. According to Andrei, it's to make (*NodeLiveness).SetDraining work but I don't understand why that call would have issues on an already draining node. That node's stores may not get new leases any more, but the other nodes still will, so why exactly are we deadlocking?

I'm going to try this again without and report with the symptoms.


pkg/server/drain.go, line 164 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I also don't understand this.

This was over-engineered. I was hoping to simplify the return statements. Removed.


pkg/server/drain.go, line 337 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this need to be redact.SafeString(", ")?

no. The type is decided above already.

@knz knz force-pushed the 20201012-liveness branch from 7f18576 to 2070b6e Compare October 20, 2020 13:11
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 200 at r1 (raw file):

That node's stores may not get new leases any more, but the other nodes still will, so why exactly are we deadlocking?

I'm assuming that Rafa ran into trouble when draining a single-node clusters.

@tbg tbg requested review from andreimatei and tbg October 28, 2020 11:55
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 219 at r5 (raw file):

Previously, knz (kena) wrote…

We don't eat that in CI, right?

You mean TestServer? No: (*TestServer).Stop() does not call Drain().
acceptance tests and roachtests would though.

I'm also not sure this is really needed. We are already waiting an average of 4-5s (at minimum) for the liveness record expiration just above this.

The wait above is a minimum: it's the minimum time necessary to ensure the other nodes will reload liveness.

However after that delay has elapsed, it may take a little more time for these nodes to actually complete the reload from KV (due to disk access, perhaps some network roundtrips etc). So to get confidence they have completed their liveness record reload, we wait a little bit more.

I'm OK with waiting less than 5 seconds, but it sounds to me that 1s is a bit on the low side.

Isn't it the maximum? If we've waited that long, the previous liveness record has "definitely" expired, so either the nodes have the draining one or none (so they're also going to stay away from it).
We mark the record as draining and it propagates out via gossip, so "usually" one gossip propagation delay later everyone knows about the draining status. That should be much faster than 5s.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 196 at r5 (raw file):

Previously, knz (kena) wrote…

Here's my reasoning, and I'd appreciate if you could weigh in one way or another.

The reasoning is that GetAvailableNodeCount() is best effort as it relies on gossip. If there's a blip in gossip, we may be seeing just 1 node where 3 or more are in fact available and serving traffic.

I wanted to be conservative in this first wait so that any real-node-but-hidden-behind-gossip-failure get a chance to reload their liveness record.

This first wait is super important because without it the lease shedding simply breaks.

The two other waits below are less important: they make the drain faster and make other 'observer' nodes discover the lease change via NLE faster too.

For a true single-node cluster, do we even get toWait > 0? There shouldn't be any work triggered, right?

I generally agree with you that we want a robust way to detect single-node clusters, and if toWait == 0 is implied by single-node, that would be ideal. I do feel that having a true single-node cluster wait here is not good enough, given how that impacts the demo/start-single-node experience. It's a papercut, and one annoying enough to avoid.


pkg/server/drain.go, line 230 at r5 (raw file):

Previously, knz (kena) wrote…

I did mean >1. Good catch.

Can you still clarify, in comments, whether the node is itself guaranteed to be part of this set? After all, it may or may not have been draining when we entered this method, so I assume GetAvailableNodeCount counts draining nodes? Would be good to comment on that on the first usage in this method.

@tbg tbg self-requested a review October 28, 2020 12:27
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/server/drain.go, line 196 at r5 (raw file):
toWait is the amount of time remaining on the node descriptor. It can certanly be >0 for a single node cluster.

I do feel that having a true single-node cluster wait here is not good enough, given how that impacts the demo/start-single-node experience.

I don't understand this sentence. Can you rephrase?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/server/drain.go, line 196 at r5 (raw file):

Previously, knz (kena) wrote…

toWait is the amount of time remaining on the node descriptor. It can certanly be >0 for a single node cluster.

I do feel that having a true single-node cluster wait here is not good enough, given how that impacts the demo/start-single-node experience.

I don't understand this sentence. Can you rephrase?

That we should find heuristics that are good enough so that ./cockroach start-single-node never inserts artificial sleeps during drain.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/server/drain.go, line 219 at r5 (raw file):

So to get confidence they have completed their liveness record reload, we wait a little bit more.

But we don't need other nodes to have completed reloading the liveness record, do we? We only need them to stop transferring leases based on the old one - which I assume is related to still using the pre-draining one and stops once that has expired (maybe modulo the clock offset, I don't know how the allocator takes that into consideration).


pkg/server/drain.go, line 170 at r6 (raw file):

		// The problem that this wait solves is the following. When other
		// nodes already have replicas for the same ranges as this node,
		// these other nodes may attempt to transfer leases to this node

Instead of this sleep, have you considered making a draining node refuse lease transfers (i.e. TransferLeaseRequests)? There's a bit of an explosion of sleeps which seem to me to be increasingly hard to explain, so I think alternatives that don't need reasoning about when other nodes find our different things are attractive.


pkg/server/drain.go, line 192 at r6 (raw file):

		if toWait > 0 {
			log.Infof(ctx, "waiting %s for the liveness record to expire", toWait)

Can we not do this waiting, and the one before it, in parallel with transferring leases (and thus amortize them)?


pkg/server/drain.go, line 201 at r6 (raw file):

			// If we believe there are other nodes, we also wait 5 seconds
			// past the expiration to give ample time for these nodes to
			// re-load their copy of this node's descriptor, prior to us

do you really wanna say "node descriptor" here? Does the descriptor change during draining?


pkg/server/drain.go, line 249 at r6 (raw file):

			"waiting %s so that final requests to this node from rest of cluster can be redirected",
			extraDrainWait)
		time.Sleep(extraDrainWait)

I think it's confusing that this extraDrainWait for two different sleeps. If we keep both sleeps, I think we should use a different constant.


pkg/server/drain.go, line 302 at r6 (raw file):

}

// report registers some drain work to the drainProgress tracker.

nit: I think the word "register" is confusing because it implies a lifetime and a de-registration. You can simply say
// report some draining work ...

@tbg tbg requested a review from andreimatei November 3, 2020 14:58
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)


pkg/server/drain.go, line 219 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

So to get confidence they have completed their liveness record reload, we wait a little bit more.

But we don't need other nodes to have completed reloading the liveness record, do we? We only need them to stop transferring leases based on the old one - which I assume is related to still using the pre-draining one and stops once that has expired (maybe modulo the clock offset, I don't know how the allocator takes that into consideration).

I agree that the argument about the liveness record doesn't really make sense here. We're already sleeping above until the record is expired, so nobody will use it any more (let's ignore clock offsets - the sleep above is copious). In fact, the node won't be used any more once the draining flag is discovered, which is one replication + one gossip delay only, so mostly significantly faster than the expiration above.

The extra 5s may be useful to get DistSender caches invalidated on the other nodes. But, honestly, I'm not sure sure. Given that Raphael claims to have seen gracefuldraining fail even with these changes, I am tempted to keep this in for now, find out why gracefuldraining fails with this code, fix that, and then reconsider this extra sleep. My hope is that it can be removed.


pkg/server/drain.go, line 170 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Instead of this sleep, have you considered making a draining node refuse lease transfers (i.e. TransferLeaseRequests)? There's a bit of an explosion of sleeps which seem to me to be increasingly hard to explain, so I think alternatives that don't need reasoning about when other nodes find our different things are attractive.

How does a node refuse a lease transfer? Last I checked, there was no such thing and I'm 👎 on incurring this additional complexity now.


pkg/server/drain.go, line 201 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

do you really wanna say "node descriptor" here? Does the descriptor change during draining?

I think "liveness record" is the one of interest here.

@knz knz force-pushed the 20201012-liveness branch from 3189210 to fa2ea3e Compare November 12, 2020 17:58
craig bot pushed a commit that referenced this pull request Nov 12, 2020
55808: kvserver: skip non-live nodes when considering candidates for transfers r=tbg a=knz

(This PR is forked off #55460 to simplify the discussion. I believe there's no discussion left here? Maybe I can merge it directly?)

Fixes #55440.

Prior to this patch, 3 components could attempt to transfer a replica
to a node currently being drained:

- the store rebalancer, which rebalances replicas based on disk
  usage and QPS.
- the allocator, to place new replicas.
- the allocator, to rebalance replicas depending on load.

This commit introduces a consideration for node liveness when building
the list of candidates, to detect whether a target node is
acceptable. Any node that is not LIVE according to its liveness status
is not considered for a transfer.

Release note (bug fix): In some cases CockroachDB would attempt to
transfer ranges to nodes in the process of being decommissioned or
being shut down; this could cause disruption the moment the node
did actually terminate. This bug has been fixed. It had been
introduced some time before v2.0.

56334: kvserver: use messages on NotLeaseholderErrors everywhere r=andreimatei a=andreimatei

NLHE permits custom messages in it, but the field was rarely used. This
patch makes every instance where we instantiate the error provide a
message, since this error comes from a wide variety of conditions.

Release note: None

56345: opt: increase cost for table descriptor fetch during virtual scan r=rytaft a=rytaft

This commit bumps the cost of each virtual scan to `25*randIOCostFactor`
from its previous value of `10*randIOCostFactor`. This new value threads
the needle so that a lookup join will still be chosen if the predicate
is very selective, but the plan for the PGJDBC query identified in #55140
no longer includes lookup joins.

Fixes #55140

Release note (performance improvement): Adjusted the cost model in
the optimizer so that the optimizer is less likely to plan a lookup
join into a virtual table. Performing a lookup join into a virtual
table is expensive, so this change will generally result in better
performance for queries involving joins with virtual tables.

56525: bazel: Move third party repositories to c-deps/REPOSITORIES.bzl r=otan a=alan-mas

bazel: Move third party repositories to c-deps/REPOSITORIES.bzl

This is one of the Bazel re-factoring that we are working on
and it is about to move third party repositories out of root WORKSPACE. 
fixes #56053

Best practices is to separate external dependencies and it also
hides the repo WORKSPACE from being used by other directories.

We are creating a new .bzl file inside c-deps with all the external dependencies
and then load then inside our root WORKSPACE

Release note: None

56589: sql: resolve error due to drop table after schema change in same txn r=ajwerner a=jayshrivastava

Previously, if a drop table statement was executed in a transaction
following other schema changes to the table in the same transaction,
an error would occur. This error was due to the drop table statement
marking previous jobs as succeeded and then proceeding to modify them.
This change ensures that drop table statement will delete all existing
jobs from the job cache so that it does not interfere with previous jobs.

Release note (sql change): A table can successfully be dropped in
a transaction following other schema changes to the table in the
same transaction.

This resolves one of the issues in #56235

56597: colflow: fix recent misuse of two slices in the flow setup r=yuzefovich a=yuzefovich

We've recently added the reusing of metadataSourcesQueue and toClose
slices in order to reduce some allocations. However, the components that
are using those slices don't make a deep copy, and as a result, we
introduced a bug in which we were breaking the current contract. This
commit fixes the issue by going back to the old method (with slight
difference in that we currently delay any allocations unlike previously
when we allocated a slice with capacity of 1).

Release note: None (no release with this bug)

56598: tree: introduce concept of "default" collation r=rafiss a=otan

Resolves #54989


Release note (sql change): Introduced a pg_collation of "default".
Strings now return the "default" collation OID in the pg_attribute
table (this was previously en_US). The "default" collation is also
visible on the pg_collation virtual table.



56602: roachpb: remove various `(gogoproto.equal)` options r=nvanbenschoten a=tbg

The first commit explains why some cleanup was necessary,
the others are the result of spending a little extra time
cleaning up "unnecessarily".

There are plenty of Equals left to clean up, but the returns
were diminishing. The important part is that when additions
to the KV API are made, nobody will be forced to add the
`equal` option any more.

- roachpb: don't generate Equal() on Error
- roachpb: remove more `Equal` methods
- roachpb: remove (gogoproto.equal) from api.proto
- roachpb: mostly remove (gogoproto.equal) from data.proto
- roachpb: remove Value.Equal
- kvserverpb: remove Equal from ReplicatedEvalResult


Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Alanmas <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@knz knz force-pushed the 20201012-liveness branch from fa2ea3e to 8ad85fd Compare January 4, 2021 15:27
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 196 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That we should find heuristics that are good enough so that ./cockroach start-single-node never inserts artificial sleeps during drain.

This is the purpose of the call to GetAvailableNodeCount() in this method. Is there anything that's needed?

Also, let's reframe this conversation in the context of #58417: if/when we have proper support for whole-cluster shutdowns, we shouldn't are that the non-single-node mechanism waits extra when run on a 1-node cluster.


pkg/server/drain.go, line 219 at r5 (raw file):
So recall this has been explained in the following comment:

Quoted 5 lines of code…
			// This wait is not necessary for correctness; it is merely an
			// optimization: it reduces the probability that another node
			// hasn't seen the expiration yet and tries to transfer a
			// lease back to this draining node during the lease drain
			// below.

I added this wait because I was seeing that the process to shed leases away was sometimes working "backwards" at the beginning, i.e. there exists a short amount of time where the other nodes have not picked up the liveness expiration yet, and are still transferring leases.

However, this does not persist, and the draining eventually succeeds. Also, there's the separate NLE wait at the end below to serve as final "blocker" for any stray requests.

So this wait here is not needed for correctness; it's merely a "performance" optimization. I'm willing to consider lowering it a bit further if so desired.


pkg/server/drain.go, line 230 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you still clarify, in comments, whether the node is itself guaranteed to be part of this set? After all, it may or may not have been draining when we entered this method, so I assume GetAvailableNodeCount counts draining nodes? Would be good to comment on that on the first usage in this method.

In the version of the code that you've looked at, GetAvailableNodeCount counts draining nodes for which the liveness record has not expired yet. Given that the draining node is merely marking its liveness as draining, but by the time we've reached this point, it's still heartbeating it, then it's never going to see it as effectively expired.

So the count includes the node itself. I had previously mistakenly misunderstood this. Fixed.


pkg/server/drain.go, line 192 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can we not do this waiting, and the one before it, in parallel with transferring leases (and thus amortize them)?

Nope we can't; explained in comment: we need to wait for the liveness record to be reloaded in other nodes otherwise the leases will be transferred back.


pkg/server/drain.go, line 201 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think "liveness record" is the one of interest here.

Done.


pkg/server/drain.go, line 249 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it's confusing that this extraDrainWait for two different sleeps. If we keep both sleeps, I think we should use a different constant.

Done.


pkg/server/drain.go, line 302 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: I think the word "register" is confusing because it implies a lifetime and a de-registration. You can simply say
// report some draining work ...

Done.

@knz
Copy link
Contributor Author

knz commented Jan 4, 2021

NB: this PR as-is passes with the gracefuldraining roachtest, after #58419 goes in.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r7.
Dismissed @andreimatei from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)


pkg/server/drain.go, line 196 at r5 (raw file):

Previously, knz (kena) wrote…

This is the purpose of the call to GetAvailableNodeCount() in this method. Is there anything that's needed?

Also, let's reframe this conversation in the context of #58417: if/when we have proper support for whole-cluster shutdowns, we shouldn't are that the non-single-node mechanism waits extra when run on a 1-node cluster.

New day, new suggestion:

  1. replace availableNodes with something that includes decommissioning nodes (i.e. it is all of the live nodes, since decommissioning nodes get kicked out of the cluster there shouldn't be much of a reason to include them any more but if you want to exclude decommissioned ones explicitly)
  2. if availableNodes == 1, don't do any waiting (not for the liveness record and no extra wait)

pkg/server/drain.go, line 219 at r5 (raw file):

Previously, knz (kena) wrote…

So recall this has been explained in the following comment:

			// This wait is not necessary for correctness; it is merely an
			// optimization: it reduces the probability that another node
			// hasn't seen the expiration yet and tries to transfer a
			// lease back to this draining node during the lease drain
			// below.

I added this wait because I was seeing that the process to shed leases away was sometimes working "backwards" at the beginning, i.e. there exists a short amount of time where the other nodes have not picked up the liveness expiration yet, and are still transferring leases.

However, this does not persist, and the draining eventually succeeds. Also, there's the separate NLE wait at the end below to serve as final "blocker" for any stray requests.

So this wait here is not needed for correctness; it's merely a "performance" optimization. I'm willing to consider lowering it a bit further if so desired.

I think 5s is excessive given that we just spent ~4.5s on average just sitting around with toWait, during basically all of which all nodes in the cluster already agreed within a few network latencies that the node is now draining. I'm fine with 2s though ultimately I don't expect it to be necessary at all, unless toWait ends up being very small. In that case we would have to do something, so a blanket 2s is probably good enough to cover all of our bases without adding excessive delays.


pkg/server/drain.go, line 230 at r5 (raw file):

Previously, knz (kena) wrote…

In the version of the code that you've looked at, GetAvailableNodeCount counts draining nodes for which the liveness record has not expired yet. Given that the draining node is merely marking its liveness as draining, but by the time we've reached this point, it's still heartbeating it, then it's never going to see it as effectively expired.

So the count includes the node itself. I had previously mistakenly misunderstood this. Fixed.

As written, there would be weirdness in a two node cluster with one node decommissioning (unusual, I know), but with my comment further up this would go away since we don't care about the decommissioning status any more.

Though maybe there's a similar problem when draining a node which has trouble marking itself as live? We could erroneously avoid going into this conditional, which seems fine.


pkg/server/drain.go, line 163 at r7 (raw file):

	// record has not expired. This includes the current (draining) node,
	// because marking the liveness record as draining does not prevent
	// it from heartbeating.

but what if the current node is also decommissioning? It's legitimate to want to drain a decommissioning node, I suppose. Wouldn't be an issue after my comment below.


pkg/server/drain.go, line 202 at r7 (raw file):

		}

		if availableNodes > 2 {

We do want to include decommissioning nodes here (and everywhere) though, since they can still hold replicas. This issue would go away with my comment higher up in this file.

@knz knz force-pushed the 20201012-liveness branch 2 times, most recently from c58953f to f854492 Compare January 21, 2021 16:05
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new unit test for this feature. It was a bit convoluted to design, but I think the overall structure works. Can you have a look and comment?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @tbg)


pkg/server/drain.go, line 196 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

New day, new suggestion:

  1. replace availableNodes with something that includes decommissioning nodes (i.e. it is all of the live nodes, since decommissioning nodes get kicked out of the cluster there shouldn't be much of a reason to include them any more but if you want to exclude decommissioned ones explicitly)
  2. if availableNodes == 1, don't do any waiting (not for the liveness record and no extra wait)

Done.


pkg/server/drain.go, line 219 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think 5s is excessive given that we just spent ~4.5s on average just sitting around with toWait, during basically all of which all nodes in the cluster already agreed within a few network latencies that the node is now draining. I'm fine with 2s though ultimately I don't expect it to be necessary at all, unless toWait ends up being very small. In that case we would have to do something, so a blanket 2s is probably good enough to cover all of our bases without adding excessive delays.

Done.


pkg/server/drain.go, line 230 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

As written, there would be weirdness in a two node cluster with one node decommissioning (unusual, I know), but with my comment further up this would go away since we don't care about the decommissioning status any more.

Though maybe there's a similar problem when draining a node which has trouble marking itself as live? We could erroneously avoid going into this conditional, which seems fine.

Done.


pkg/server/drain.go, line 163 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

but what if the current node is also decommissioning? It's legitimate to want to drain a decommissioning node, I suppose. Wouldn't be an issue after my comment below.

Done.


pkg/server/drain.go, line 202 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We do want to include decommissioning nodes here (and everywhere) though, since they can still hold replicas. This issue would go away with my comment higher up in this file.

Done.

…he end of drain

Prior to this patch, it was possible for a node to shut down
gracefully "too quickly", before the other nodes got a chance to see
that the node has gone away.

In particular it was possible:

- while the node was pushing leases away, it was possible for the
  other nodes with replicas on shared ranges to push them back (store
  rebalance / allocator). This is because the other nodes did not yet
  have a copy of the updated node descriptor marked "draining".

- after the node had moved its leases away and stopped, it was
  possible for range caches on other nodes to continue to try to use
  replicas on the drained node.

To alleviate both issues, this commit makes a server wait until the
expiry deadline on the draining node's liveness. This prevents other
nodes from considering the draining node as a candidate and "push
back" the leases to it during the drain. An additional wait of 5
seconds is added at the very end after all leases have transferred, so
that if another node still finds itself wanting to address a replica
on the now-drained node, it gets a chance to get a
NodeLeaseHolderError and a redirect to the new leaseholder. This is
expected to be the most effective change in this commit.

Additionally, this commit adds 2 seconds after waiting on liveness
expiry, before starting to transfer leases away. This way, there is
confidence during the lease transfer that the other nodes know the
draining node is, in fact, draining, and will not be considered as a
transfer target. This is an optimization.

Release note: None
@knz knz force-pushed the 20201012-liveness branch from f854492 to 81b90e5 Compare January 21, 2021 16:52
@knz
Copy link
Contributor Author

knz commented Jan 21, 2021

@tbg I am a bit unhappy to see my new test fail with

drain_test.go:272: wait-replication: zone config not applied properly: {2,3,4} / lease at n4

Why is the zone config constraint not applied properly?

@knz
Copy link
Contributor Author

knz commented Feb 3, 2021

@tbg suggests during meeting:

  • try manual replication
  • try placing the lease manually using "transfer lease" API

In any case avoid zone configs and allocator.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@knz knz removed the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@knz knz marked this pull request as draft May 6, 2021 10:51
@nvanbenschoten nvanbenschoten removed their request for review September 27, 2022 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants