storage: exit process when disks are stalled #32978

tbg · 2018-12-10T15:26:18Z

We've had two recent incidents in which we saw clusters with disks
stalled on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

the symptoms are nondescript: from the UI it often looks like a Raft
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
the node is in some semi-live state that borders the byzantine and
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
it's sudden and often can't be gleaned from the logs (since everything
is fine and then nothing ever completes so no "alertable" metrics are
emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes #7882.
Fixes #32736.

Touches #7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.

cockroach-teamcity · 2018-12-10T15:26:25Z

This change is

knz · 2018-12-10T15:36:15Z

pkg/server/server_engine_health.go

+	for _, eng := range engines {
+		func() {
+			t := time.AfterFunc(maxDuration, func() {
+				log.Shout(ctx, log.Severity_FATAL, fmt.Sprintf(


here and below: this is likely to not-work when crash reporting is enabled. In that case, a FATAL severity will cause things to happen (including disk access) before the call to os.Exit.

So, either:

ensure the tests for this occur with crash reporting is enabled, or

modify the code further so that there is a guaranteed call to os.Exit even if the log.Shout never terminates. For example

time.AfterFunc(maxDuration, func ...) time.Afterfunc(maxDuration + 1 * time.Second, func() { os.Exit(...) })

Ah, I had reviewed log.Fatal and saw that it had a case for this and assumed things were handled already. I'll make sure to adopt your suggestion.

petermattis

modulo the problem with LogData not actually doing anything.

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/server/server_engine_health.go, line 41 at r1 (raw file):

				t.Read = true
				t.Reset(engineHealthCheckInterval)
				assertLoggingHealth(ctx, engineHealthCheckInterval)

Did you intend to use engineHealthCheckInterval for assertLoggingHealth? Seems ok, though perhaps that should be renamed as diskHealthCheckInterval.

pkg/server/server_engine_health.go, line 82 at r1 (raw file):

	defer timer.Stop()

	log.Flush()

Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.

bdarnell

LGTM, but this reminds me of the server.clock.persist_upper_bound_interval setting. One of the main reasons we didn't enable that by default (see #25109) is the performance impact of additional disk writes. If we're going to write to disk frequently anyway, maybe we should just add a timeout to it and turn that on by default. (and reduce the number of goroutines performing synced writes to disk).

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

)

var engineHealthCheckInterval = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_HEALTH_CHECK_INTERVAL", 5*time.Second)

Why an env var instead of a cluster setting?

pkg/server/server_engine_health.go, line 82 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.

Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.

tbg

LGTM, but this reminds me of the server.clock.persist_upper_bound_interval setting.

Yeah, I can see that, but just jamming this logic in there and making it on by default is more than I have bandwidth for right now. I agree in principle though.

Dismissed @knz from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why an env var instead of a cluster setting?

It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.

pkg/server/server_engine_health.go, line 41 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you intend to use engineHealthCheckInterval for assertLoggingHealth? Seems ok, though perhaps that should be renamed as diskHealthCheckInterval.

Moot now that logging is doing its own thing.

pkg/server/server_engine_health.go, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.

Moved the logging logic to the periodic flusher.

Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.

But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

FYI, a single disk write can be delayed for a significant period of time if the OS is writing a lot of dirty data. I'll point you to the private issue about this. It might be worthwhile tolerating longer write delays.

pkg/storage/node_liveness.go, line 779 at r1 (raw file):

				errors.Wrapf(err, "unable to update node liveness")
			}
		}

I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.

Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.

It might be worthwhile tolerating longer write delays.

I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).

pkg/storage/node_liveness.go, line 779 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.

I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.

tbg

TFTR!
PS @petermattis do you want to take this over? You seem to have concerns and I'm looking to get this one over with.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.

It might be worthwhile tolerating longer write delays.

I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).

Ok. Let's go with this for now. We'll know it is in there and can keep an eye out if this check is overly sensitive to external disturbances on the machine.

pkg/storage/node_liveness.go, line 779 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.

Ack. Thinking about this more I think this is fine. Node liveness might thing a node is alive for a few more seconds when it can't actually write to disk, but that should be ok.

tbg · 2019-01-02T20:46:05Z

Minor fixes and made sure the roachtest actually works. I needed to reintroduce env vars because the roachtest needs them (I could futz with charybde but not worth it at the moment).

bors r=petermattis

craig · 2019-01-02T21:07:35Z

Build failed

GitHub CI (Cockroach)

tbg · 2019-01-03T14:08:31Z

Got the acceptance tests passing (h/t @knz)

bors r=petermattis

32978: storage: exit process when disks are stalled r=petermattis a=tbg We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes #7882. Fixes #32736. Touches #7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state. Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2019-01-03T14:25:41Z

Build succeeded

GitHub CI (Cockroach)

Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None

Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.

41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy Ensure the cluster is fully upgraded when running import. Release note: ensure cluster fully upgraded when running import. 41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes #41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters. 41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in #24591 (+ #33122). This was undone in #32978 (which introduced a stricter version of a similar check). #32978 was later disabled by default in #36484, leaving us without the protections first introduced in #24591. This PR re-adds the logic from #24591. Part of #41683. Release note: None. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]>

Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.

This pattern is bad: ```go func NewServer() *Server { s.a = newA() s.b = newB(s.a, s.c) s.c = newC() return s } ``` Note how `newB` gets called with `s.c == nil`. Creating a partially populated `Server` struc early is not a good idea; there is no protection against using fields before they are populated. Instead, use this pattern: ```go func NewServer() *Server { a := newA() c := newC() b := newB(a, c) return &Server{a: a, b: b, c: c} } ``` Note how the bug from the first example would be caught at compile time because we can't use something before it's defined. In fact, carrying out this refactor revealed a bug: we were always passing a `nil` engines slice to `NewNodeLiveness`, in effect disabling the disk stall protection in cockroachdb#32978. Release note: None

tbg requested a review from a team December 10, 2018 15:26

tbg requested review from petermattis and knz December 10, 2018 15:27

knz reviewed Dec 10, 2018

View reviewed changes

petermattis mentioned this pull request Dec 10, 2018

[WIP] storage/engine: propertly support batches containing only LogData #32986

Closed

petermattis approved these changes Dec 10, 2018

View reviewed changes

bdarnell approved these changes Dec 11, 2018

View reviewed changes

tbg force-pushed the fix/fatal-on-slow-disk branch from 09f848a to f93a0c3 Compare December 13, 2018 13:16

tbg commented Dec 13, 2018

View reviewed changes

tbg force-pushed the fix/fatal-on-slow-disk branch from f93a0c3 to f1be469 Compare December 13, 2018 15:50

bdarnell approved these changes Dec 13, 2018

View reviewed changes

petermattis reviewed Dec 13, 2018

View reviewed changes

tbg commented Dec 14, 2018

View reviewed changes

petermattis approved these changes Dec 14, 2018

View reviewed changes

tbg force-pushed the fix/fatal-on-slow-disk branch 3 times, most recently from 57e545f to 1d664c4 Compare January 2, 2019 20:42

tbg force-pushed the fix/fatal-on-slow-disk branch from 1d664c4 to 08c5994 Compare January 3, 2019 09:39

tbg mentioned this pull request Jan 3, 2019

log: Shout() not behaving correctly when invoked as very first log call #33458

Closed

tbg force-pushed the fix/fatal-on-slow-disk branch from 08c5994 to 5b0dd65 Compare January 3, 2019 12:44

tbg requested a review from a team as a code owner January 3, 2019 12:44

tbg force-pushed the fix/fatal-on-slow-disk branch from 5b0dd65 to d591a49 Compare January 3, 2019 13:49

craig bot merged commit d591a49 into cockroachdb:master Jan 3, 2019

tbg deleted the fix/fatal-on-slow-disk branch January 3, 2019 14:43

bdarnell mentioned this pull request Oct 17, 2019

storage: Write to local RocksDB before updating liveness #41683

Closed

irfansharif mentioned this pull request Oct 21, 2019

storage: write to local storage before updating liveness #41761

Merged

irfansharif mentioned this pull request Oct 21, 2019

release-19.2: storage: write to local storage before updating liveness #41764

Merged

irfansharif mentioned this pull request Oct 21, 2019

release-19.1: storage: write to local storage before updating liveness #41765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: exit process when disks are stalled #32978

storage: exit process when disks are stalled #32978

tbg commented Dec 10, 2018

cockroach-teamcity commented Dec 10, 2018

knz Dec 10, 2018

tbg Dec 10, 2018

petermattis left a comment

bdarnell left a comment

tbg left a comment

bdarnell left a comment

petermattis left a comment

tbg left a comment

tbg left a comment

petermattis left a comment

tbg commented Jan 2, 2019

craig bot commented Jan 2, 2019

tbg commented Jan 3, 2019

craig bot commented Jan 3, 2019

storage: exit process when disks are stalled #32978

storage: exit process when disks are stalled #32978

Conversation

tbg commented Dec 10, 2018

cockroach-teamcity commented Dec 10, 2018

knz Dec 10, 2018

Choose a reason for hiding this comment

tbg Dec 10, 2018

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Jan 2, 2019

craig bot commented Jan 2, 2019

Build failed

tbg commented Jan 3, 2019

craig bot commented Jan 3, 2019

Build succeeded