-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: disk stall prevents lease transfer #81100
Comments
81133: liveness: improve disk probes during node liveness updates r=tbg a=erikgrinaker When `NodeLiveness` updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures that a node with a stalled disk will fail to maintain liveness and lose its leases. However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack, in particular with lease acquisitions that do a synchronous heartbeat. This patch does the sync write in a separate goroutine in order to respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write. Additionally, this runs the sync writes in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes with many stores, to avoid spurious heartbeat failures under load. Touches #81100. Release note (bug fix): Disk write probes during node liveness heartbeats will no longer get stuck on stalled disks, instead returning an error once the operation times out. Additionally, disk probes now run in parallel on nodes with multiple stores. 81134: kvserver: don't return lease when failing to acquire r=nvanbenschoten,tbg a=erikgrinaker **kvserver: improve lease acquisition in `Store.ManuallyEnqueue()`** Manually enqueuing a range through a queue may sometimes require acquiring the lease first, since it can only be run on the leaseholder. Previously this was done through `getLeaseForGossip`, which was meant for acquiring the gossip lease and would hide certain errors from the queue. This patch instead uses `redirectOnOrAcquireLease` to acquire the lease, and returns an error if the local replica failed to acquire the lease (even if there is a valid leaseholder elsewhere). Release note: None **kvserver: don't return lease when failing to acquire** This patch does not return the previous, expired lease when a replica fails to acquire it for itself. In particular, this prevents the calling DistSender from continually retrying the old leaseholder if it fails to reacquire the lease (e.g. due to a stalled disk). Touches #81100. Release note: None 81416: roachtest: do not fail costfuzz test on perturbed error r=msirek a=michae2 The costfuzz test repeatedly executes a randomly generated query twice, once with a normal plan and once with a perturbed-costs plan. We were considering a successful first execution followed by an errored second execution a test failure. But there are certain queries that legitimately return errors nondeterministically depending on plan. For example: ```sql CREATE TABLE t (a INT PRIMARY KEY); CREATE TABLE u (b INT PRIMARY KEY); INSERT INTO t VALUES (0); SELECT * FROM t INNER HASH JOIN u ON a = b + 1 WHERE 1 / a = 1; SELECT * FROM u INNER HASH JOIN t ON a = b + 1 WHERE 1 / a = 1; ``` The first plan will successfully return 0 rows because the hash join can short-circuit without evaluating 1 / a. The second plan will return a divide-by-zero error because 1 / a is evaluated while building the hash table. This is not a bug. Internal errors are a little more interesting than normal SQL-level errors, so we still consider this a test failure on internal errors. Fixes: #81032 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Michael Erickson <[email protected]>
Short-term fixes have been submitted in:
These improve the situation somewhat, in that a single non-system range will typically recover within ~20 seconds. However, if a query touches multiple affected ranges then the ~20s wait is sequential by range, and can easily add up to several minutes. Additionally, system ranges may not recover at all because the RPC caller may time out before the NLHE is returned, preventing anyone else from acquiring the lease (#80713). Also, in-flight requests will block indefinitely until we implement replica circuit breaker support for this scenario (#77366). I think a full solution here may require rethinking our lease acquisition/transfer protocol such that it isn't driven by client requests and dependant on active cooperation of the previous (stalled) leaseholder. We'll need to allocate some resources for this, so I'm going to leave this here for now. |
81136: kvserver: add timeout for lease acquisitions r=tbg a=erikgrinaker **kvserver: deflake `TestLeasePreferencesDuringOutage`** This patch deflakes `TestLeasePreferencesDuringOutage` by allowing nodes to acquire a lease even when they are not the Raft leader. Release note: None **kvserver: add timeout for lease acquisitions** This patch adds a timeout for lease acquisitions. It is set to twice the Raft election timeout (6 seconds total), since it may need to hold a Raft election and repropose the lease acquisition command, each of which may take up to one election timeout. Without this timeout, it's possible for a lease acquisition to stall indefinitely (e.g. in the case of a stalled disk). This prevents a `NotLeaseHolderError` from being returned to the client DistSender, which in turn prevents it from trying other replicas that could acquire the lease instead. This can cause a lease to remain invalid forever. Release note (bug fix): Fixed a bug where an unresponsive node (e.g. with a stalled disk) could prevent other nodes from acquiring its leases, effectively stalling these ranges until the node was shut down or recovered. Touches #81100. Release note (bug fix): Fixed a bug where an unresponsive node (e.g. with a stalled disk) could prevent other nodes from acquiring its leases, effectively stalling these ranges until the node was shut down or recovered. Co-authored-by: Erik Grinaker <[email protected]>
94240: roachtest: add `disk-stall` variant of `failover` tests r=erikgrinaker a=erikgrinaker **roachtest: reorganize failer interface** This patch reorganizes the `failer` interface with `Setup`, `Ready`, and `Cleanup` hooks, providing it with access to the monitor. It also passes the test and cluster references via the constructor. Epic: none Release note: None **roachtest: add `disk-stall` variant of `failover` tests** This patch adds `disk-stall` variants of the `failover` tests, which benchmarks the pMax unavailability when a leaseholder experiences a persistent disk stall. Resolves #94231. Touches #81100. Epic: none Release note: None Co-authored-by: Erik Grinaker <[email protected]>
We have marked this issue as stale because it has been inactive for |
This will be fixed by #118943. |
A node experiencing a persistent disk stall will fail to maintain liveness, since we do a sync disk write before each heartbeat:
cockroach/pkg/kv/kvserver/liveness/liveness.go
Lines 1258 to 1264 in 0b7ba56
This will in turn cause it to lose all leases. However, it will also prevent anyone else from acquiring the lease, causing all relevant ranges to get stuck until the disk recovers or the node is shut down (e.g. due to the Pebble disk stall detector which panics).
Notes from elsewhere below.
Reproduction Details
I'm intentionally trying to reduce the failure modes to the simplest case here -- we can look into more complicated ones when we've picked the low-hanging fruit.
workload
, n5 is the faulty node.Instructions:
Create a 6-node cluster and stage it with necessary binaries:
Wrap the SSDs in a delayable device, with 0 delay initially:
Start the cluster and configure it:
Start a KV workload from n6 against n1-4:
Introduce an IO delay of 100s per operation on n5 (immediately takes effect, but command hangs for a while):
The workload throughput will now have dropped to ~0.
When n5 stalls, all other nodes have it cached as the leaseholder, and will send RPCs to it first. These nodes rely on n5 to return a
NotLeaseHolderError
to inform them that the lease is invalid, at which point they will try a different node. When this other node receives the RPC, it will detect the invalid lease, acquire a new lease, then return a successful response to the caller who updates its lease cache.What happens here is that n5 never returns a
NotLeaseHolderError
, which prevents the rest from happening -- including another node claiming the lease. The lease thus remains invalid, and the workload stalls.So why doesn't n5 return a
NotLeaseHolderError
? It notices that its lease is invalid and tries to obtain it for itself first. However, while doing so it first sends a heartbeat to make sure it's live before acquiring the lease. Recall that sending a heartbeat will perform a sync write (which is what made it lose the lease in the first place), and because the disk stalls this operation therefore also stalls. Forever.cockroach/pkg/kv/kvserver/replica_range_lease.go
Lines 373 to 377 in 5ffaa9e
If I add a simple 5-second timeout for lease acquisitions, this operations fails and returns a
NotLeaseHolderError
, allowing a different node to claim the lease. You may think "how does context cancellation work when we're blocked on a sync disk write?". Because we're not actually blocked on a sync write, we're blocked on a semaphore waiting for a previous heartbeat to complete, and that's blocked on a sync write:cockroach/pkg/kv/kvserver/liveness/liveness.go
Lines 890 to 897 in 0b7ba56
With this timeout in place, things are much better:
To completely fix this, such that clients don't get stuck on Raft application (or waiting on their latches), we'll need circuit breaker support (#74712). However, even without that, this is a much better state of affairs where clients with timeouts will recover within a reasonable time.
I'm not sure a timeout is necessarily the best approach here, so I'm going to look into a few options, but this is definitely tractable in the short term and likely backportable in some form.
Jira issue: CRDB-15145
gz#13737
Epic CRDB-19227
The text was updated successfully, but these errors were encountered: