stability: client hangs after one node hits disk errors #7882

ramanala · 2016-07-18T16:04:18Z

I start a 3-node cluster using a small script like this:

cockroach start --store=data0 --log-dir=log0
cockroach start --store=data1 --log-dir=log1 --port=26258 --http-port=8081 --join=localhost:26257 --join=localhost:26259 &
cockroach start --store=data2 --log-dir=log2 --port=26259 --http-port=8082 --join=localhost:26257 --join=localhost:26258 &
sleep 5

At this point, I see that the cluster is setup correctly and I can start inserting and reading data out. So far so good.

Now, during this startup, node2 (the node with data2 as its store) hits a disk full error (-ENOSPC) or a -EIO error when trying to append to a SSTable file and fails to start. At this point, expectation is that the cluster continues to accept client connections and make progress with transactions (as majority of nodes are available). This is my client code:

server_ports = ["26257", "26258", "26259"]
server_id = 0
for port in server_ports:
    try:
        conn = psycopg2.connect(host="localhost", port=port, database = "mydb", user="xxx", connect_timeout=10)
        conn.set_session(autocommit=True)
        cur = conn.cursor()
        cur.execute("SELECT * FROM mytable;")
        record = cur.fetchall()
            print result
        cur.close()
        conn.close()
    except Exception as e:
            print 'Exception:' + str(e)
        time.sleep(3)

I see that the client successfully connected to node0 but hangs after that (in the execute statement) atleast for 30 seconds (after which I abort the thread running this above code).

I am not sure what is going wrong here. Shouldn't the client be able to talk to node0 or node1 irrespective of node2's failure? I am not sure who was the leader for this table before node2 crashed. Irrespective of who was the old leader, shouldn't the cluster automatically elect a new leader (if node2 were the old leader) and continue to make progress in any case?

Moreover, this is not a one-off case. This happens when one node hits a disk error (EIO, EDQUOT, ENOSPC) at several points in time and trying to access different files such as MANIFEST, sst, dbtmp.

Expectation: Whatever error a single node encounter, as long as the majority is available, the cluster needs to make progress (agreeing upon a new leader if the failed node was the old leader).

Observed: When one node hits a disk error, sometimes (depending on error type and file being accessed when the error happened), the cluster becomes unusable as clients start blocking after connecting to one of the remaining nodes that did not encounter any errors.

I have attached the server logs. I would be happy with providing any further details.

logs.tar.gz

The text was updated successfully, but these errors were encountered:

ramanala · 2016-07-18T19:25:00Z

Please notice that subsequent requests go through fine if retried after waiting for few tens of seconds. But the original request blocks forever. So this might be something to do with how client handles connections and NOT a server side problem. Updating title to reflect this.

petermattis · 2017-04-20T16:03:00Z

A lot of changes have been made in lease handling (which is likely the cause of the tens of second wait before success), but handling of of disk-full errors deserves further attention in 1.1.

petermattis · 2017-08-23T16:07:16Z

@dianasaur323 We might want to schedule time to look at this in 1.2. Enhancements in this area are now beyond reach for 1.1 give the current point in the cycle.

dianasaur323 · 2017-09-18T00:54:26Z

thanks for the heads up, @petermattis. Do you think this merits a t-shirt size in Airtable, or should I just allocate time for this to be addressed as part of bugfixing?

petermattis · 2017-09-18T14:02:00Z

@dianasaur323 This is more than just a big fix as there are many code paths to inspect and a bit of design work to be done. At least an M size project. Probably an L.

dianasaur323 · 2017-09-18T15:56:03Z

Great, thanks @petermattis. I added in a placeholder in Airtable. That being said, this squarely falls into @awoods187's space, correct @nstewart?

nstewart · 2017-09-18T16:41:02Z

Thats correct

dianasaur323 · 2017-09-18T16:43:49Z

kk, just assigned him in airtable as well.

tbg · 2018-10-11T11:51:14Z

In addressing this, we should also look at the case in which the log directory is on a separate partition and fills up/becomes otherwise unavailable. See #7646

tbg · 2018-10-12T21:36:06Z

Some more discussion in #8473.

awoods187 · 2018-10-30T20:27:58Z

I think we should not die when we run out of disk space. Instead, we should stop writing to disk and allow for users to add more disk. Thoughts?

petermattis · 2018-10-30T21:47:23Z

@awoods187 That behavior is clearly desirable, though it is much more difficult to achieve than it is to express. To give some sense of the difficulty, removing a replica from a node involves writing a small amount of data to disk to indicate that the replica was removed before the replica's data is actually deleted. In other words, we need to write to disk in order to free up space on disk.

We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes cockroachdb#7882. Fixes cockroachdb#32736. Touches cockroachdb#7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state.

32978: storage: exit process when disks are stalled r=petermattis a=tbg We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes #7882. Fixes #32736. Touches #7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state. Co-authored-by: Tobias Schottdorf <[email protected]>

petermattis changed the title ~~Cluster unavailable after one node hits disk errors~~ stability: cluster unavailable after one node hits disk errors Jul 18, 2016

ramanala changed the title ~~stability: cluster unavailable after one node hits disk errors~~ stability: client hangs after one node hits disk errors Jul 18, 2016

dianasaur323 added community-questions and removed community-questions labels Feb 2, 2017

petermattis added this to the 1.0 milestone Feb 22, 2017

spencerkimball assigned petermattis Apr 2, 2017

petermattis modified the milestones: 1.1, 1.0 Apr 20, 2017

petermattis modified the milestones: 1.2, 1.1 Aug 23, 2017

dianasaur323 self-assigned this Sep 18, 2017

petermattis modified the milestones: 2.0, 2.1 Feb 5, 2018

petermattis assigned m-schneider and unassigned dianasaur323 and petermattis Mar 28, 2018

knz added O-community Originated from the community C-investigation Further steps needed to qualify. C-label will change. and removed O-community-questions labels Apr 24, 2018

tbg modified the milestones: 2.1, 2.2 Jul 19, 2018

This was referenced Oct 11, 2018

roachtest: test recovery from out of disk after removing ballast file #22387

Closed

storage: temporary block on logging output causes logic errors to appear #7646

Closed

tbg mentioned this issue Oct 12, 2018

roachtest: test locality awareness #24818

Closed

tbg mentioned this issue Dec 5, 2018

Frozen disk leads to unavailability #32736

Closed

tbg assigned tbg and unassigned m-schneider Dec 10, 2018

tbg mentioned this issue Dec 10, 2018

storage: exit process when disks are stalled #32978

Merged

craig bot closed this as completed in #32978 Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: client hangs after one node hits disk errors #7882

stability: client hangs after one node hits disk errors #7882

ramanala commented Jul 18, 2016

ramanala commented Jul 18, 2016 •

edited

Loading

petermattis commented Apr 20, 2017

petermattis commented Aug 23, 2017

dianasaur323 commented Sep 18, 2017

petermattis commented Sep 18, 2017

dianasaur323 commented Sep 18, 2017

nstewart commented Sep 18, 2017

dianasaur323 commented Sep 18, 2017

tbg commented Oct 11, 2018

tbg commented Oct 12, 2018

awoods187 commented Oct 30, 2018

petermattis commented Oct 30, 2018

stability: client hangs after one node hits disk errors #7882

stability: client hangs after one node hits disk errors #7882

Comments

ramanala commented Jul 18, 2016

ramanala commented Jul 18, 2016 • edited Loading

petermattis commented Apr 20, 2017

petermattis commented Aug 23, 2017

dianasaur323 commented Sep 18, 2017

petermattis commented Sep 18, 2017

dianasaur323 commented Sep 18, 2017

nstewart commented Sep 18, 2017

dianasaur323 commented Sep 18, 2017

tbg commented Oct 11, 2018

tbg commented Oct 12, 2018

awoods187 commented Oct 30, 2018

petermattis commented Oct 30, 2018

ramanala commented Jul 18, 2016 •

edited

Loading