-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: exit process when disks are stalled #32978
Conversation
pkg/server/server_engine_health.go
Outdated
for _, eng := range engines { | ||
func() { | ||
t := time.AfterFunc(maxDuration, func() { | ||
log.Shout(ctx, log.Severity_FATAL, fmt.Sprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and below: this is likely to not-work when crash reporting is enabled. In that case, a FATAL severity will cause things to happen (including disk access) before the call to os.Exit.
So, either:
- ensure the tests for this occur with crash reporting is enabled, or
- modify the code further so that there is a guaranteed call to os.Exit even if the log.Shout never terminates. For example
time.AfterFunc(maxDuration, func ...)
time.Afterfunc(maxDuration + 1 * time.Second, func() { os.Exit(...) })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had reviewed log.Fatal
and saw that it had a case for this and assumed things were handled already. I'll make sure to adopt your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo the problem with LogData
not actually doing anything.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/server/server_engine_health.go, line 41 at r1 (raw file):
t.Read = true t.Reset(engineHealthCheckInterval) assertLoggingHealth(ctx, engineHealthCheckInterval)
Did you intend to use engineHealthCheckInterval
for assertLoggingHealth
? Seems ok, though perhaps that should be renamed as diskHealthCheckInterval
.
pkg/server/server_engine_health.go, line 82 at r1 (raw file):
defer timer.Stop() log.Flush()
Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but this reminds me of the server.clock.persist_upper_bound_interval
setting. One of the main reasons we didn't enable that by default (see #25109) is the performance impact of additional disk writes. If we're going to write to disk frequently anyway, maybe we should just add a timeout to it and turn that on by default. (and reduce the number of goroutines performing synced writes to disk).
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
) var engineHealthCheckInterval = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_HEALTH_CHECK_INTERVAL", 5*time.Second)
Why an env var instead of a cluster setting?
pkg/server/server_engine_health.go, line 82 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Doesn't the logging system already have a periodic flush? I believe it just doesn't have the detection of a flush taking an excessively long period of time.
Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.
09f848a
to
f93a0c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but this reminds me of the server.clock.persist_upper_bound_interval setting.
Yeah, I can see that, but just jamming this logic in there and making it on by default is more than I have bandwidth for right now. I agree in principle though.
Dismissed @knz from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Why an env var instead of a cluster setting?
It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.
pkg/server/server_engine_health.go, line 41 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Did you intend to use
engineHealthCheckInterval
forassertLoggingHealth
? Seems ok, though perhaps that should be renamed asdiskHealthCheckInterval
.
Moot now that logging is doing its own thing.
pkg/server/server_engine_health.go, line 82 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Yeah, the log package has a goroutine flushing every 30s. This one will be more expensive (or at least more frequent) at 5s. I'm not sure the cost is a problem, but it doesn't make sense having multiple goroutines performing synchronous log flushes so I'd just add a timeout to the existing log flusher goroutine.
Moved the logging logic to the periodic flusher.
Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None
f93a0c3
to
f1be469
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It didn't seem like anything that's reasonable to change. If anything you want an escape hatch for disabling it, which this is. It gets trickier to do a cluster setting now that we're doing something similar in the log package.
But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
But the log package isn't using this env var now. The log flush loop doesn't really bother me because it becomes a no-op if no log messages have been written (or at least it should), so really it's just deferred I/O from logging and an idle cluster won't do anything there. This one is generating pure new load so there's more reason to want to control it. I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.
FYI, a single disk write can be delayed for a significant period of time if the OS is writing a lot of dirty data. I'll point you to the private issue about this. It might be worthwhile tolerating longer write delays.
pkg/storage/node_liveness.go, line 779 at r1 (raw file):
errors.Wrapf(err, "unable to update node liveness") } }
I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.
Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.
Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.
It might be worthwhile tolerating longer write delays.
I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).
pkg/storage/node_liveness.go, line 779 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm curious why you removed this. Don't we still want to check that the store can perform a write quickly before updating liveness? Perhaps you should incorporate the engine health check here. That would also get rid of the env variable which Ben dislikes.
I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
PS @petermattis do you want to take this over? You seem to have concerns and I'm looking to get this one over with.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server_engine_health.go, line 29 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd slightly prefer not to introduce an env var here - either make it a cluster setting or leave it unconfigurable.
Ok, I removed it. I don't want to make it a cluster setting since it seems we're going to want to fold this elsewhere in the future. 🤞 that we never have to disable it.
It might be worthwhile tolerating longer write delays.
I'm not sure. We handle gray failure poorly. I'm generally wary of inserting process termination into the system, but if a node regularly (or just once) got 10s stalls this would send ripples through the cluster that we would then spend significant resources investigating. At that point I'd rather know (and I had made this an env var to have an escape hatch in case this is really wanted or needed at that point in time, though Ben made me remove it). Perhaps logging can take the place of a panic, though it becomes difficult to assure that the log messages make it out. I'd personally prefer to leave this to someone else. I'm already less satisfied with the compromise than with what I had initially, so I'm more than happy to leave further adjustments to you and/or Ben (who may also not be on the same page).
Ok. Let's go with this for now. We'll know it is in there and can keep an eye out if this check is overly sensitive to external disturbances on the machine.
pkg/storage/node_liveness.go, line 779 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I removed this because it was obsolete and Ben was concerned about syncing unnecessarily frequently. I conceptually dislike intertwining code like that, but I suppose it could be justified. I think you and Ben should take a second pass at this after it merges if you feel strongly enough, the main contribution here is the roachtest anyway.
Ack. Thinking about this more I think this is fine. Node liveness might thing a node is alive for a few more seconds when it can't actually write to disk, but that should be ok.
57e545f
to
1d664c4
Compare
Minor fixes and made sure the roachtest actually works. I needed to reintroduce env vars because the roachtest needs them (I could futz with charybde but not worth it at the moment). bors r=petermattis |
Build failed |
1d664c4
to
08c5994
Compare
08c5994
to
5b0dd65
Compare
5b0dd65
to
d591a49
Compare
Got the acceptance tests passing (h/t @knz) bors r=petermattis |
32978: storage: exit process when disks are stalled r=petermattis a=tbg We've had two recent incidents in which we saw clusters with disks stalled on a subset of nodes in the cluster. This is a fairly treacherous failure mode since - the symptoms are nondescript: from the UI it often looks like a Raft problem, logspy will freeze up as well, and so you waste some time until you end up looking at the goroutine dump and notice the writes stuck in syscall - the node is in some semi-live state that borders the byzantine and can cause further trouble for the part of the cluster that isn't affected (we have some mitigations against this in place but not enough, and need to improve our defense mechanisms). - it's sudden and often can't be gleaned from the logs (since everything is fine and then nothing ever completes so no "alertable" metrics are emitted). This commit introduces a simple mechanism that periodically checks for these conditions (both on the engines and logging) and invokes a fatal error if necessary. The accompanying roachtest exercises both a data and a logging disk stall. Fixes #7882. Fixes #32736. Touches #7646. Release note (bug fix): CockroachDB will error with a fatal exit when data or logging partitions become unresponsive. Previously, the process would remain running, though in an unresponsive state. Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
Committing a batch that only contained LogData was accidentally turned into a no-op since the batch was erroneously considered empty. The change (in the preceding commit) makes cockroachdb#24591 not a no-op for the first time since that PR was merged. This bug was discovered during the following commit (see cockroachdb#32978) when the roachtest introduced therein would not detect a disk stall (assuming the logging partition wasn't also affected). Closes cockroachdb#32986. Release note: None
Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.
41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy Ensure the cluster is fully upgraded when running import. Release note: ensure cluster fully upgraded when running import. 41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes #41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters. 41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in #24591 (+ #33122). This was undone in #32978 (which introduced a stricter version of a similar check). #32978 was later disabled by default in #36484, leaving us without the protections first introduced in #24591. This PR re-adds the logic from #24591. Part of #41683. Release note: None. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]>
Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.
Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.
Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in cockroachdb#24591 (+ cockroachdb#33122). This was undone in cockroachdb#32978 (which introduced a stricter version of a similar check). cockroachdb#32978 was later disabled by default in cockroachdb#36484, leaving us without the protections first introduced in cockroachdb#24591. This PR re-adds the logic from cockroachdb#24591. Release note: None.
This pattern is bad: ```go func NewServer() *Server { s.a = newA() s.b = newB(s.a, s.c) s.c = newC() return s } ``` Note how `newB` gets called with `s.c == nil`. Creating a partially populated `Server` struc early is not a good idea; there is no protection against using fields before they are populated. Instead, use this pattern: ```go func NewServer() *Server { a := newA() c := newC() b := newB(a, c) return &Server{a: a, b: b, c: c} } ``` Note how the bug from the first example would be caught at compile time because we can't use something before it's defined. In fact, carrying out this refactor revealed a bug: we were always passing a `nil` engines slice to `NewNodeLiveness`, in effect disabling the disk stall protection in cockroachdb#32978. Release note: None
This pattern is bad: ```go func NewServer() *Server { s.a = newA() s.b = newB(s.a, s.c) s.c = newC() return s } ``` Note how `newB` gets called with `s.c == nil`. Creating a partially populated `Server` struc early is not a good idea; there is no protection against using fields before they are populated. Instead, use this pattern: ```go func NewServer() *Server { a := newA() c := newC() b := newB(a, c) return &Server{a: a, b: b, c: c} } ``` Note how the bug from the first example would be caught at compile time because we can't use something before it's defined. In fact, carrying out this refactor revealed a bug: we were always passing a `nil` engines slice to `NewNodeLiveness`, in effect disabling the disk stall protection in cockroachdb#32978. Release note: None
We've had two recent incidents in which we saw clusters with disks
stalled on a subset of nodes in the cluster. This is a fairly
treacherous failure mode since
problem, logspy will freeze up as well, and so you waste some time until
you end up looking at the goroutine dump and notice the writes stuck in
syscall
can cause further trouble for the part of the cluster that isn't
affected (we have some mitigations against this in place but not
enough, and need to improve our defense mechanisms).
is fine and then nothing ever completes so no "alertable" metrics are
emitted).
This commit introduces a simple mechanism that periodically checks for
these conditions (both on the engines and logging) and invokes a fatal
error if necessary.
The accompanying roachtest exercises both a data and a logging disk
stall.
Fixes #7882.
Fixes #32736.
Touches #7646.
Release note (bug fix): CockroachDB will error with a fatal exit when
data or logging partitions become unresponsive. Previously, the process
would remain running, though in an unresponsive state.