Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: client hangs after one node hits disk errors #7882

Closed
ramanala opened this issue Jul 18, 2016 · 14 comments · Fixed by #32978
Closed

stability: client hangs after one node hits disk errors #7882

ramanala opened this issue Jul 18, 2016 · 14 comments · Fixed by #32978
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@ramanala
Copy link

I start a 3-node cluster using a small script like this:

cockroach start --store=data0 --log-dir=log0
cockroach start --store=data1 --log-dir=log1 --port=26258 --http-port=8081 --join=localhost:26257 --join=localhost:26259 &
cockroach start --store=data2 --log-dir=log2 --port=26259 --http-port=8082 --join=localhost:26257 --join=localhost:26258 &
sleep 5

At this point, I see that the cluster is setup correctly and I can start inserting and reading data out. So far so good.

Now, during this startup, node2 (the node with data2 as its store) hits a disk full error (-ENOSPC) or a -EIO error when trying to append to a SSTable file and fails to start. At this point, expectation is that the cluster continues to accept client connections and make progress with transactions (as majority of nodes are available). This is my client code:

server_ports = ["26257", "26258", "26259"]
server_id = 0
for port in server_ports:
    try:
        conn = psycopg2.connect(host="localhost", port=port, database = "mydb", user="xxx", connect_timeout=10)
        conn.set_session(autocommit=True)
        cur = conn.cursor()
        cur.execute("SELECT * FROM mytable;")
        record = cur.fetchall()
            print result
        cur.close()
        conn.close()
    except Exception as e:
            print 'Exception:' + str(e)
        time.sleep(3) 

I see that the client successfully connected to node0 but hangs after that (in the execute statement) atleast for 30 seconds (after which I abort the thread running this above code).

I am not sure what is going wrong here. Shouldn't the client be able to talk to node0 or node1 irrespective of node2's failure? I am not sure who was the leader for this table before node2 crashed. Irrespective of who was the old leader, shouldn't the cluster automatically elect a new leader (if node2 were the old leader) and continue to make progress in any case?

Moreover, this is not a one-off case. This happens when one node hits a disk error (EIO, EDQUOT, ENOSPC) at several points in time and trying to access different files such as MANIFEST, sst, dbtmp.

Expectation: Whatever error a single node encounter, as long as the majority is available, the cluster needs to make progress (agreeing upon a new leader if the failed node was the old leader).

Observed: When one node hits a disk error, sometimes (depending on error type and file being accessed when the error happened), the cluster becomes unusable as clients start blocking after connecting to one of the remaining nodes that did not encounter any errors.

I have attached the server logs. I would be happy with providing any further details.

logs.tar.gz

@petermattis petermattis changed the title Cluster unavailable after one node hits disk errors stability: cluster unavailable after one node hits disk errors Jul 18, 2016
@ramanala
Copy link
Author

ramanala commented Jul 18, 2016

Please notice that subsequent requests go through fine if retried after waiting for few tens of seconds. But the original request blocks forever. So this might be something to do with how client handles connections and NOT a server side problem. Updating title to reflect this.

@ramanala ramanala changed the title stability: cluster unavailable after one node hits disk errors stability: client hangs after one node hits disk errors Jul 18, 2016
@petermattis petermattis added this to the 1.0 milestone Feb 22, 2017
@petermattis
Copy link
Collaborator

A lot of changes have been made in lease handling (which is likely the cause of the tens of second wait before success), but handling of of disk-full errors deserves further attention in 1.1.

@petermattis petermattis modified the milestones: 1.1, 1.0 Apr 20, 2017
@petermattis
Copy link
Collaborator

@dianasaur323 We might want to schedule time to look at this in 1.2. Enhancements in this area are now beyond reach for 1.1 give the current point in the cycle.

@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 23, 2017
@dianasaur323 dianasaur323 self-assigned this Sep 18, 2017
@dianasaur323
Copy link
Contributor

thanks for the heads up, @petermattis. Do you think this merits a t-shirt size in Airtable, or should I just allocate time for this to be addressed as part of bugfixing?

@petermattis
Copy link
Collaborator

@dianasaur323 This is more than just a big fix as there are many code paths to inspect and a bit of design work to be done. At least an M size project. Probably an L.

@dianasaur323
Copy link
Contributor

Great, thanks @petermattis. I added in a placeholder in Airtable. That being said, this squarely falls into @awoods187's space, correct @nstewart?

@nstewart
Copy link
Contributor

Thats correct

@dianasaur323
Copy link
Contributor

kk, just assigned him in airtable as well.

@petermattis petermattis modified the milestones: 2.0, 2.1 Feb 5, 2018
@knz knz added O-community Originated from the community C-investigation Further steps needed to qualify. C-label will change. and removed O-community-questions labels Apr 24, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

In addressing this, we should also look at the case in which the log directory is on a separate partition and fills up/becomes otherwise unavailable. See #7646

@tbg
Copy link
Member

tbg commented Oct 12, 2018

Some more discussion in #8473.

@awoods187
Copy link
Contributor

I think we should not die when we run out of disk space. Instead, we should stop writing to disk and allow for users to add more disk. Thoughts?

@petermattis
Copy link
Collaborator

@awoods187 That behavior is clearly desirable, though it is much more difficult to achieve than it is to express. To give some sense of the difficulty, removing a replica from a node involves writing a small amount of data to disk to indicate that the replica was removed before the replica's data is actually deleted. In other words, we need to write to disk in order to free up space on disk.

@tbg tbg assigned tbg and unassigned m-schneider Dec 10, 2018
tbg added a commit to tbg/cockroach that referenced this issue Dec 10, 2018
We've had two recent incidents in which we saw clusters with disks
stalled  on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

- the symptoms are nondescript: from the UI it often looks like a Raft
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
- the node is in some semi-live state that borders the byzantine and
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
- it's sudden and often can't be gleaned from the logs (since everything
is fine and then nothing ever completes so no "alertable" metrics are
emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes cockroachdb#7882.
Fixes cockroachdb#32736.

Touches cockroachdb#7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.
tbg added a commit to tbg/cockroach that referenced this issue Jan 3, 2019
We've had two recent incidents in which we saw clusters with disks
stalled  on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

- the symptoms are nondescript: from the UI it often looks like a Raft
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
- the node is in some semi-live state that borders the byzantine and
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
- it's sudden and often can't be gleaned from the logs (since everything
is fine and then nothing ever completes so no "alertable" metrics are
emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes cockroachdb#7882.
Fixes cockroachdb#32736.

Touches cockroachdb#7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.
craig bot pushed a commit that referenced this issue Jan 3, 2019
32978: storage: exit process when disks are stalled r=petermattis a=tbg

We've had two recent incidents in which we saw clusters with disks
stalled  on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

- the symptoms are nondescript: from the UI it often looks like a Raft
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
- the node is in some semi-live state that borders the byzantine and
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
- it's sudden and often can't be gleaned from the logs (since everything
is fine and then nothing ever completes so no "alertable" metrics are
emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes #7882.
Fixes #32736.

Touches #7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig craig bot closed this as completed in #32978 Jan 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants