-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: client hangs after one node hits disk errors #7882
Comments
Please notice that subsequent requests go through fine if retried after waiting for few tens of seconds. But the original request blocks forever. So this might be something to do with how client handles connections and NOT a server side problem. Updating title to reflect this. |
A lot of changes have been made in lease handling (which is likely the cause of the tens of second wait before success), but handling of of disk-full errors deserves further attention in 1.1. |
@dianasaur323 We might want to schedule time to look at this in 1.2. Enhancements in this area are now beyond reach for 1.1 give the current point in the cycle. |
thanks for the heads up, @petermattis. Do you think this merits a t-shirt size in Airtable, or should I just allocate time for this to be addressed as part of bugfixing? |
@dianasaur323 This is more than just a big fix as there are many code paths to inspect and a bit of design work to be done. At least an M size project. Probably an L. |
Great, thanks @petermattis. I added in a placeholder in Airtable. That being said, this squarely falls into @awoods187's space, correct @nstewart? |
Thats correct |
kk, just assigned him in airtable as well. |
In addressing this, we should also look at the case in which the log directory is on a separate partition and fills up/becomes otherwise unavailable. See #7646 |
Some more discussion in #8473. |
I think we should not die when we run out of disk space. Instead, we should stop writing to disk and allow for users to add more disk. Thoughts? |
@awoods187 That behavior is clearly desirable, though it is much more difficult to achieve than it is to express. To give some sense of the difficulty, removing a replica from a node involves writing a small amount of data to disk to indicate that the replica was removed before the replica's data is actually deleted. In other words, we need to write to disk in order to free up space on disk. |
We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes cockroachdb#7882. Fixes cockroachdb#32736. Touches cockroachdb#7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state.
We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes cockroachdb#7882. Fixes cockroachdb#32736. Touches cockroachdb#7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state.
32978: storage: exit process when disks are stalled r=petermattis a=tbg We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes #7882. Fixes #32736. Touches #7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state. Co-authored-by: Tobias Schottdorf <[email protected]>
I start a 3-node cluster using a small script like this:
cockroach start --store=data0 --log-dir=log0
cockroach start --store=data1 --log-dir=log1 --port=26258 --http-port=8081 --join=localhost:26257 --join=localhost:26259 &
cockroach start --store=data2 --log-dir=log2 --port=26259 --http-port=8082 --join=localhost:26257 --join=localhost:26258 &
sleep 5
At this point, I see that the cluster is setup correctly and I can start inserting and reading data out. So far so good.
Now, during this startup, node2 (the node with data2 as its store) hits a disk full error (-ENOSPC) or a -EIO error when trying to append to a SSTable file and fails to start. At this point, expectation is that the cluster continues to accept client connections and make progress with transactions (as majority of nodes are available). This is my client code:
I see that the client successfully connected to node0 but hangs after that (in the execute statement) atleast for 30 seconds (after which I abort the thread running this above code).
I am not sure what is going wrong here. Shouldn't the client be able to talk to node0 or node1 irrespective of node2's failure? I am not sure who was the leader for this table before node2 crashed. Irrespective of who was the old leader, shouldn't the cluster automatically elect a new leader (if node2 were the old leader) and continue to make progress in any case?
Moreover, this is not a one-off case. This happens when one node hits a disk error (EIO, EDQUOT, ENOSPC) at several points in time and trying to access different files such as MANIFEST, sst, dbtmp.
Expectation: Whatever error a single node encounter, as long as the majority is available, the cluster needs to make progress (agreeing upon a new leader if the failed node was the old leader).
Observed: When one node hits a disk error, sometimes (depending on error type and file being accessed when the error happened), the cluster becomes unusable as clients start blocking after connecting to one of the remaining nodes that did not encounter any errors.
I have attached the server logs. I would be happy with providing any further details.
logs.tar.gz
The text was updated successfully, but these errors were encountered: