Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: exit process when disks are stalled #32978

Merged
merged 1 commit into from
Jan 3, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 10, 2018

We've had two recent incidents in which we saw clusters with disks
stalled on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

  • the symptoms are nondescript: from the UI it often looks like a Raft
    problem, logspy will freeze up as well, and so you waste some time until
    you end up looking at the goroutine dump and notice the writes stuck in
    syscall
  • the node is in some semi-live state that borders the byzantine and
    can cause further trouble for the part of the cluster that isn't
    affected (we have some mitigations against this in place but not
    enough, and need to improve our defense mechanisms).
  • it's sudden and often can't be gleaned from the logs (since everything
    is fine and then nothing ever completes so no "alertable" metrics are
    emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes #7882.
Fixes #32736.

Touches #7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.

@tbg tbg requested a review from a team December 10, 2018 15:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested review from petermattis and knz December 10, 2018 15:27
for _, eng := range engines {
func() {
t := time.AfterFunc(maxDuration, func() {
log.Shout(ctx, log.Severity_FATAL, fmt.Sprintf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below: this is likely to not-work when crash reporting is enabled. In that case, a FATAL severity will cause things to happen (including disk access) before the call to os.Exit.

So, either:

  • ensure the tests for this occur with crash reporting is enabled, or
  • modify the code further so that there is a guaranteed call to os.Exit even if the log.Shout never terminates. For example
time.AfterFunc(maxDuration, func ...)
time.Afterfunc(maxDuration + 1 * time.Second, func() { os.Exit(...) })

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had reviewed log.Fatal and saw that it had a case for this and assumed things were handled already. I'll make sure to adopt your suggestion.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo the problem with LogData not actually doing anything.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/server/server_engine_health.go, line 41 at r1 (raw file):

				t.Read = true
				t.Reset(engineHealthCheckInterval)
				assertLoggingHealth(ctx, engineHealthCheckInterval)

Did you intend to use engineHealthCheckInterval for assertLoggingHealth? Seems ok, though perhaps that should be renamed as diskHealthCheckInterval.


pkg/server/server_engine_health.go, line 82 at r1 (raw file):

	defer timer.Stop()

	log.Flush()

Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this reminds me of the server.clock.persist_upper_bound_interval setting. One of the main reasons we didn't enable that by default (see #25109) is the performance impact of additional disk writes. If we're going to write to disk frequently anyway, maybe we should just add a timeout to it and turn that on by default. (and reduce the number of goroutines performing synced writes to disk).

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

)

var engineHealthCheckInterval = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_HEALTH_CHECK_INTERVAL", 5*time.Second)

Why an env var instead of a cluster setting?


pkg/server/server_engine_health.go, line 82 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.

Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.

@tbg tbg force-pushed the fix/fatal-on-slow-disk branch from 09f848a to f93a0c3 Compare December 13, 2018 13:16
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this reminds me of the server.clock.persist_upper_bound_interval setting.

Yeah, I can see that, but just jamming this logic in there and making it on by default is more than I have bandwidth for right now. I agree in principle though.

Dismissed @knz from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why an env var instead of a cluster setting?

It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.


pkg/server/server_engine_health.go, line 41 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you intend to use engineHealthCheckInterval for assertLoggingHealth? Seems ok, though perhaps that should be renamed as diskHealthCheckInterval.

Moot now that logging is doing its own thing.


pkg/server/server_engine_health.go, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.

Moved the logging logic to the periodic flusher.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 13, 2018
Committing a batch that only contained LogData was accidentally turned
into a no-op since the batch was erroneously considered empty.

The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the
first time since that PR was merged.

This bug was discovered during the following commit (see cockroachdb#32978) when
the roachtest introduced therein would not detect a disk stall (assuming
the logging partition wasn't also affected).

Closes cockroachdb#32986.

Release note: None
@tbg tbg force-pushed the fix/fatal-on-slow-disk branch from f93a0c3 to f1be469 Compare December 13, 2018 15:50
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.

But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

FYI, a single disk write can be delayed for a significant period of time if the OS is writing a lot of dirty data. I'll point you to the private issue about this. It might be worthwhile tolerating longer write delays.


pkg/storage/node_liveness.go, line 779 at r1 (raw file):

				errors.Wrapf(err, "unable to update node liveness")
			}
		}

I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 14, 2018
Committing a batch that only contained LogData was accidentally turned
into a no-op since the batch was erroneously considered empty.

The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the
first time since that PR was merged.

This bug was discovered during the following commit (see cockroachdb#32978) when
the roachtest introduced therein would not detect a disk stall (assuming
the logging partition wasn't also affected).

Closes cockroachdb#32986.

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.

It might be worthwhile tolerating longer write delays.

I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).


pkg/storage/node_liveness.go, line 779 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.

I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!
PS @petermattis do you want to take this over? You seem to have concerns and I'm looking to get this one over with.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server_engine_health.go, line 29 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.

Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.

It might be worthwhile tolerating longer write delays.

I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).

Ok. Let's go with this for now. We'll know it is in there and can keep an eye out if this check is overly sensitive to external disturbances on the machine.


pkg/storage/node_liveness.go, line 779 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.

Ack. Thinking about this more I think this is fine. Node liveness might thing a node is alive for a few more seconds when it can't actually write to disk, but that should be ok.

@tbg tbg force-pushed the fix/fatal-on-slow-disk branch 3 times, most recently from 57e545f to 1d664c4 Compare January 2, 2019 20:42
@tbg
Copy link
Member Author

tbg commented Jan 2, 2019

Minor fixes and made sure the roachtest actually works. I needed to reintroduce env vars because the roachtest needs them (I could futz with charybde but not worth it at the moment).

bors r=petermattis

@craig
Copy link
Contributor

craig bot commented Jan 2, 2019

Build failed

@tbg tbg force-pushed the fix/fatal-on-slow-disk branch from 1d664c4 to 08c5994 Compare January 3, 2019 09:39
@tbg tbg force-pushed the fix/fatal-on-slow-disk branch from 08c5994 to 5b0dd65 Compare January 3, 2019 12:44
@tbg tbg requested a review from a team as a code owner January 3, 2019 12:44
@tbg tbg force-pushed the fix/fatal-on-slow-disk branch from 5b0dd65 to d591a49 Compare January 3, 2019 13:49
@tbg
Copy link
Member Author

tbg commented Jan 3, 2019

Got the acceptance tests passing (h/t @knz)

bors r=petermattis

craig bot pushed a commit that referenced this pull request Jan 3, 2019
32978: storage: exit process when disks are stalled r=petermattis a=tbg

We've had two recent incidents in which we saw clusters with disks
stalled  on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since

- the symptoms are nondescript: from the UI it often looks like a Raft
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
- the node is in some semi-live state that borders the byzantine and
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
- it's sudden and often can't be gleaned from the logs (since everything
is fine and then nothing ever completes so no "alertable" metrics are
emitted).

This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.

The accompanying roachtest exercises both a data and a logging disk
stall.

Fixes #7882.
Fixes #32736.

Touches #7646.

Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 3, 2019

Build succeeded

@craig craig bot merged commit d591a49 into cockroachdb:master Jan 3, 2019
@tbg tbg deleted the fix/fatal-on-slow-disk branch January 3, 2019 14:43
irfansharif pushed a commit to irfansharif/cockroach that referenced this pull request Oct 21, 2019
Committing a batch that only contained LogData was accidentally turned
into a no-op since the batch was erroneously considered empty.

The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the
first time since that PR was merged.

This bug was discovered during the following commit (see cockroachdb#32978) when
the roachtest introduced therein would not detect a disk stall (assuming
the logging partition wasn't also affected).

Closes cockroachdb#32986.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 21, 2019
Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978
(which introduced a stricter version of a similar check). cockroachdb#32978 was
later disabled by default in cockroachdb#36484, leaving us without the protections
first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591.

Release note: None.
craig bot pushed a commit that referenced this pull request Oct 21, 2019
41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy

Ensure the cluster is fully upgraded when running import.

Release note: ensure cluster fully upgraded when running import.

41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei

This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.

Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.

Fixes #41609

Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.

41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif

Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in #24591 (+ #33122). This was undone in #32978
(which introduced a stricter version of a similar check). #32978 was
later disabled by default in #36484, leaving us without the protections
first introduced in #24591. This PR re-adds the logic from #24591.

Part of #41683.

Release note: None.

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 21, 2019
Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978
(which introduced a stricter version of a similar check). cockroachdb#32978 was
later disabled by default in cockroachdb#36484, leaving us without the protections
first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591.

Release note: None.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 21, 2019
Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978
(which introduced a stricter version of a similar check). cockroachdb#32978 was
later disabled by default in cockroachdb#36484, leaving us without the protections
first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591.

Release note: None.
arulajmani pushed a commit to arulajmani/cockroach that referenced this pull request Oct 29, 2019
Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978
(which introduced a stricter version of a similar check). cockroachdb#32978 was
later disabled by default in cockroachdb#36484, leaving us without the protections
first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591.

Release note: None.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 7, 2020
This pattern is bad:

```go
func NewServer() *Server {
  s.a = newA()
  s.b = newB(s.a, s.c)
  s.c = newC()
  return s
}
```

Note how `newB` gets called with `s.c == nil`. Creating a partially
populated `Server` struc early is not a good idea; there is no
protection against using fields before they are populated.

Instead, use this pattern:

```go
func NewServer() *Server {
  a := newA()
  c := newC()
  b := newB(a, c)
  return &Server{a: a, b: b, c: c}
}
```

Note how the bug from the first example would be caught at compile time
because we can't use something before it's defined.

In fact, carrying out this refactor revealed a bug: we were always
passing a `nil` engines slice to `NewNodeLiveness`, in effect disabling
the disk stall protection in cockroachdb#32978.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Apr 8, 2020
This pattern is bad:

```go
func NewServer() *Server {
  s.a = newA()
  s.b = newB(s.a, s.c)
  s.c = newC()
  return s
}
```

Note how `newB` gets called with `s.c == nil`. Creating a partially
populated `Server` struc early is not a good idea; there is no
protection against using fields before they are populated.

Instead, use this pattern:

```go
func NewServer() *Server {
  a := newA()
  c := newC()
  b := newB(a, c)
  return &Server{a: a, b: b, c: c}
}
```

Note how the bug from the first example would be caught at compile time
because we can't use something before it's defined.

In fact, carrying out this refactor revealed a bug: we were always
passing a `nil` engines slice to `NewNodeLiveness`, in effect disabling
the disk stall protection in cockroachdb#32978.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Frozen disk leads to unavailability stability: client hangs after one node hits disk errors
5 participants