kvserver: enqueue replicas into lease queue when IO overloaded #119776

kvoli · 2024-02-29T19:05:26Z

Only the last commit should be reviewed.

Previously, when the allocator lease IO overload threshold enforcement
was set to shed, leases would take up to 10 minutes to drain from an
IO overloaded store. This was because the shed pacing mechanism was the
replica scanner, which by default takes up to 10 minutes.

This commit introduces (1) proactive gossip of the store capacity when
it both increases over the last value and is greater than an absolute
value (0.2), and (2) enqueues leaseholder replicas into the lease queue
when receiving a gossip update, if the gossip update is for the local
store and the store is IO overloaded.

In order to prevent wasted resources, the mass enqueue on overload only
occurs at most once every
kv.allocator.min_io_overload_lease_shed_interval, which defaults to
30s.

The combination of (1) proactive gossip on IO overload increase, and (2)
enqueuing into the lease queue on shed conditions being met, results in
leases shedding from an IO overloaded node in under a second.

Part of: #118866
Release note: None

blathers-crl · 2024-02-29T19:05:31Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-02-29T19:05:45Z

This change is

andrewbaptist

Lets chat briefly about the changes in store_gossip, but otherwise this looks OK.

Reviewed 4 of 7 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/store.go line 2603 at r1 (raw file):

	s.ioThreshold.Unlock()

	ioThresholdMax := *ioThreshold

nit: use proto.Clone to make a deep clone here instead of just dereference. This works fine since it is a shallow struct now, but in case it changes in the future this code would break. It would also catch if it had uncloneable fields.

pkg/kv/kvserver/store.go line 2620 at r1 (raw file):

func (s *Store) existingLeaseCheckIOOverload(ctx context.Context) bool {
	storeList, _, _ := s.cfg.StorePool.GetStoreList(storepool.StoreFilterNone)
	storeDescriptor, ok := s.cfg.StorePool.GetStoreDescriptor(s.StoreID())

nit: This is a little ugly that you are pulling this out of the StorePool rather than using the StoreDescriptor that is passed into the CapacityChangeFn. The only part of the local StoreDescriptor that is used in the ExistingLeaseCheck is the StoreCapacity, and you have this already. This might work, but it seems a little unfortunate to not pass this directly in and change ExistingLeaseCheck to only take a StoreCapacity.

pkg/kv/kvserver/store.go line 2676 at r1 (raw file):

		if err := s.stopper.RunTask(ctx, "io-overload: shed leases", func(ctx context.Context) {
			newStoreReplicaVisitor(s).Visit(func(repl *Replica) bool {
				s.leaseQueue.maybeAdd(ctx, repl, repl.Clock().NowAsClockTimestamp())

It would be good if there was some sort of stat tracking leases which were shed this way, or maybe update the log message to list the number of leases that are going to be shed.

pkg/kv/kvserver/store.go line 2681 at r1 (raw file):

		}); err != nil {
			log.KvDistribution.Infof(ctx,
				"unable to shed leases due to IO overload: %v", err)

nit: The only time this error ever occurs is during shutdown. If that is true, you might want to make it clear that it didn't run because of shutdown and you don't need to reset the timer.

pkg/kv/kvserver/store_gossip.go line 65 at r1 (raw file):

	// gossipMinMaxIOOverloadScore is the minimum IO overload required that is
	// required to trigger gossip.
	gossipMinMaxIOOverloadScore = 0.2

I think I'm missing why we need to do this and why we are using this hardcoded value rather than the LeaseIOOverloadThreshold. Is this threshold and eager gossiping required at all? I see the advantage of pushing rather than waiting, but this seems like a strange way to do it. Doesn't this have a problem that if it is ramping up slowly over a few requests that we won't trigger? Maybe we can chat briefly.

kvoli

TYFTR

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @kvoli)

pkg/kv/kvserver/store.go line 2603 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: use proto.Clone to make a deep clone here instead of just dereference. This works fine since it is a shallow struct now, but in case it changes in the future this code would break. It would also catch if it had uncloneable fields.

Updated.

pkg/kv/kvserver/store.go line 2620 at r1 (raw file):

This is a little ugly that you are pulling this out of the StorePool rather than using the StoreDescriptor that is passed into the CapacityChangeFn.

This function is also used in CanTransferLease, without which the lease transfers would get throttled at min_lease_transfer_interval. It is unfortunate its being pulled out of the store pool, but it is necessary to pull out the store list regardless for the mean check.

The only part of the local StoreDescriptor that is used in the ExistingLeaseCheck is the StoreCapacity, and you have this already.

The StoreID is also used in ExistingLeaseCheck and the other lease checks for logging.

pkg/kv/kvserver/store.go line 2676 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

It would be good if there was some sort of stat tracking leases which were shed this way, or maybe update the log message to list the number of leases that are going to be shed.

Good idea, added the log message to mention the number of leases being shed.

pkg/kv/kvserver/store.go line 2681 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: The only time this error ever occurs is during shutdown. If that is true, you might want to make it clear that it didn't run because of shutdown and you don't need to reset the timer.

I've gotten rid of this and updated the comment.

pkg/kv/kvserver/store_gossip.go line 65 at r1 (raw file):

Is this threshold and eager gossiping required at all?

Its required for two reasons:

If we didn't gossip immediately and shed the leases, they wouldn't actually shed (because the allocator uses the storepool value) and if they did shed, they would come back to the store quickly because other stores are unaware.
The gossip mechanism triggers the lease shedding on the recipient side, which is useful because of the aforementioned requirement for other stores to know ahead of time via gossip.

Doesn't this have a problem that if it is ramping up slowly over a few requests that we won't trigger?

It doesn't have this problem, the condition to eagerly gossip is a higher IO threshold max score AND a higher score than previously gossiped. There's no relative (%) here.

[...] why we are using this hardcoded value rather than the LeaseIOOverloadThreshold

To avoid including cluster settings in the struct. I've updated this to use LeaseIIOOverloadThreshold.

Previously, when the allocator lease IO overload threshold enforcement was set to `shed`, leases would take up to 10 minutes to drain from an IO overloaded store. This was because the shed pacing mechanism was the replica scanner, which by default takes up to 10 minutes. This commit introduces (1) proactive gossip of the store capacity when it both increases over the last value and is greater than an absolute value (LeaseIOOverloadThreshold), and (2) enqueues leaseholder replicas into the lease queue when receiving a gossip update, if the gossip update is for the local store and the store is IO overloaded. In order to prevent wasted resources, the mass enqueue on overload only occurs at most once every `kv.allocator.min_io_overload_lease_shed_interval`, which defaults to 30s. The combination of (1) proactive gossip on IO overload increase, and (2) enqueuing into the lease queue on shed conditions being met, results in leases shedding from an IO overloaded node in under a second. Part of: cockroachdb#118866 Release note: None

kvoli · 2024-03-22T17:19:48Z

bors r=andrewbaptist

craig · 2024-03-22T18:35:36Z

Build succeeded:

kvoli force-pushed the 240216.mass-enqueue-leases branch 3 times, most recently from 41190c1 to 7a25702 Compare March 4, 2024 20:43

kvoli self-assigned this Mar 4, 2024

kvoli force-pushed the 240216.mass-enqueue-leases branch from 7a25702 to 0c766db Compare March 4, 2024 21:24

This was referenced Mar 4, 2024

roachtest: make the outage for kv/restart longer #119783

Merged

roachtest: assert limit capacity throughput 90% of baseline #119652

Merged

kvoli force-pushed the 240216.mass-enqueue-leases branch 3 times, most recently from 3276462 to 9df2170 Compare March 5, 2024 16:09

kvoli marked this pull request as ready for review March 6, 2024 15:58

kvoli requested a review from a team as a code owner March 6, 2024 15:58

kvoli requested review from a team March 6, 2024 15:58

kvoli force-pushed the 240216.mass-enqueue-leases branch 2 times, most recently from ffed777 to 7758c59 Compare March 11, 2024 19:19

kvoli requested a review from a team as a code owner March 11, 2024 19:19

kvoli requested review from srosenberg and renatolabs and removed request for a team March 11, 2024 19:19

kvoli force-pushed the 240216.mass-enqueue-leases branch 2 times, most recently from 441f018 to 03d28d9 Compare March 11, 2024 21:48

kvoli mentioned this pull request Mar 12, 2024

kvserver: separate allocator lease execution from replication #118866

Closed

7 tasks

kvoli force-pushed the 240216.mass-enqueue-leases branch from 03d28d9 to 45e8776 Compare March 12, 2024 21:08

kvoli requested a review from a team as a code owner March 12, 2024 21:08

kvoli requested a review from a team March 12, 2024 21:08

kvoli requested review from a team as code owners March 12, 2024 21:08

kvoli requested review from abarganier and dt and removed request for a team, srosenberg, renatolabs, abarganier and dt March 12, 2024 21:08

kvoli force-pushed the 240216.mass-enqueue-leases branch from 45e8776 to 16679ea Compare March 12, 2024 21:10

kvoli requested a review from andrewbaptist March 13, 2024 19:55

kvoli force-pushed the 240216.mass-enqueue-leases branch 4 times, most recently from d30ca28 to b904814 Compare March 18, 2024 14:18

andrewbaptist approved these changes Mar 21, 2024

View reviewed changes

kvoli force-pushed the 240216.mass-enqueue-leases branch from b904814 to 35f7d5c Compare March 22, 2024 14:47

kvoli commented Mar 22, 2024

View reviewed changes

kvoli force-pushed the 240216.mass-enqueue-leases branch from 35f7d5c to 888bc90 Compare March 22, 2024 15:32

craig bot merged commit 8d3b129 into cockroachdb:master Mar 22, 2024
20 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: enqueue replicas into lease queue when IO overloaded #119776

kvserver: enqueue replicas into lease queue when IO overloaded #119776

kvoli commented Feb 29, 2024 •

edited

Loading

blathers-crl bot commented Feb 29, 2024

cockroach-teamcity commented Feb 29, 2024

andrewbaptist left a comment

kvoli left a comment

kvoli commented Mar 22, 2024

craig bot commented Mar 22, 2024

kvserver: enqueue replicas into lease queue when IO overloaded #119776

kvserver: enqueue replicas into lease queue when IO overloaded #119776

Conversation

kvoli commented Feb 29, 2024 • edited Loading

blathers-crl bot commented Feb 29, 2024

cockroach-teamcity commented Feb 29, 2024

andrewbaptist left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

kvoli commented Mar 22, 2024

craig bot commented Mar 22, 2024

kvoli commented Feb 29, 2024 •

edited

Loading