-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user cluster has three wedged timeseries ranges #17524
Comments
@cuongdo for triage. |
@nvanbenschoten please take a look at this |
Yeah I'll take a look. |
I've put in a milestone for this so that it doesn't get lost in the large unknown of no milestones! Feel free to move around and adjust as necessary. |
The cluster's logs and debug pages revealed a few interesting things about the state of @HeikoOnnebrink's cluster. The first is that Range 1 and Range 2 were both cycling through leadership terms repeatedly. The leader of each range was bouncing back and forth between two nodes, and a leadership election was kicking off every 10 seconds. The term numbers for the ranges were around 180,000 when I was looking, meaning that at the current rate, the election cycle must have been spinning for over a week. This seemed like a pretty serious issue, but @tschottdorf mentioned that this may be because the cluster had no load on it, and because this part of the keyspace uses expiration-based leases. It was also clear that the command queue was clogged up and that no progress was being made for critical parts of the keyspace. Some of these stuck commands were writes to the NodeLiveness keys. These writes being stuck would explain why the Admin UI showed all nodes as down even though they were accessible through the debug pages.
Finally, the goroutine dump showed a few channel selects that had been blocked for days. The anomalous chan receive was actually the one @tschottdorf has linked to above. There, the I'm not positive this was the only issue with @HeikoOnnebrink's test cluster, but fixing the deadlock is a step in the right direction. We discussed that once this fix gets merged he'll pull down master and perform a rolling upgrade, at which point we can continue to monitor the cluster for any other issues. |
Maybe fixes cockroachdb#17524. When a `timeutil.Timer` is read from, it needs to set `timer.Read = true` before calling `timer.Reset` or it risks deadlocking itself. A deadlock in the `quotaPool` was possible because it missed this step. Once the `quotaPool` deadlocked, it would block the `CommandQueue` and wreak havoc on a cluster.
Thanks for looking into this @nvanbenschoten and @tschottdorf! Are there metrics/graphs or warnings that can avoid this type of issue in the future? For example, maybe a new metric for the age of the oldest entry in the command queue? Raft election counts? |
We have metrics for slow requests in distsender, raft proposals, the command-queue and lease acquisition. We also have metrics for Raft elections. Were any of these showing something interesting? |
The command-queue logging that I pasted above along was pretty interesting as it clearly demonstrated that a command was stuck. Raft leader election logging was also interesting, though I didn't realize that we already have metrics for it. Perhaps we could create a tool/debug endpoint to parse a goroutine dump and pick out any surprising results. It wouldn't give any insight into livelock conditions, but could certainly help with more tame deadlock conditions like we had here. |
integrating a version of |
Did you mean to close this, btw? Should probably reopen until the cluster is in at least a few days' worth of good state. |
TIL "Maybe fixes" still closes the corresponding issue. |
"It's basically impossible that this fixes #17524". 😄 |
Another round of debugging revealed more issues with the cluster. It still looks like we're having issue with the NodeLiveness, and there are some clues as to why. On the cluster's node 6, we see that 20 goroutines are blocked while acquiring a semaphore in
|
Sounds like |
The debug page shows that the leader and current leaseholder of After talking to @tschottdorf it sounds like this retry loop is actually expected until the node's Raft log is able to catch up, so something must be going wrong with Raft. Unfortunately, we're unable to bump the vmodule because it got put behind a cluster setting, so it's tricky getting insight into exactly what's going on in Raft. The idea now is to revert this vmodule change and restart node 3 with the new binary. |
While investigating cockroachdb#17524, we found that a NodeLiveness request was stuck, which blocked the NodeLiveness semaphore. This prevented any liveness updates from being sent by the source of the stuck liveness update. While introducing a timeout here on the liveness loop won't fix the issue we saw, it should prevent stuck liveness updates in the future from preventing a node from ever updating its liveness record, which makes everything harder to deal with.
A screenshot in cockroachdb#17524 (comment) made me think we might have an off-by-one error here (because it showed the range starting with NodeLivenessKeyMax using an expiration-based lease), but it was a false alarm caused by a bug in the range debug page (cockroachdb#17843). In any case, it's good to test here.
After a restart to With @tschottdorf's fancy new LogSpy™ tool, we could see that the leader was trying to catch up
The leader would then log:
This loop continued indefinitely. This was strange because it meant that the follower had a Raft entry at the
cockroach/pkg/storage/replica_raftstorage.go Lines 235 to 248 in 45396f8
The first check is notable. It uses an optimization to cache the
@tschottdorf theorized that this could be the case we were seeing because I'm looking into writing up a test case to further validate this hypothesis. |
We clear |
But then we replace it with the previous |
Ah, that's a very good point. |
Addresses the current issue in cockroachdb#17524. I'll open an issue to properly test this, but that might take some time. Fow now this seems like a clear fix that we should get in sooner rather than later.
Addresses the current issue in cockroachdb#17524. I'll open an issue to properly test this, but that might take some time. Fow now this seems like a clear fix that we should get in sooner rather than later.
If restarting n9 allowed the cluster to catch up quickly, why didn't it recover when upgraded to |
It did recover for a while after upgrading to |
Interesting. I wonder why this cluster is getting into this state on a regular basis but we haven't (AFAIK) seen it elsewhere. |
If this is the cached |
Addresses the current issue in cockroachdb#17524. I'll open an issue to properly test this, but that might take some time. Fow now this seems like a clear fix that we should get in sooner rather than later.
The problem is actually occurring again on the PM cluster after upgrade. I'm seeing failed snapshot logs again, and there are several "No Lease" and "Underreplicated" ranges. |
Let's move this to #18324. |
@nvanbenschoten we're done here, correct? |
Yeah, let's close this for now. We can reopen if the problem persists or create a new issue if something else comes up from @HeikoOnnebrink. |
Forked from #17478 (comment).
We have access to this cluster via a Cisco WebEx session (you'll need a chrome plugin) at a meeting link that can be activated by Gitter user @HeikoOnnebrink (he'll supervise while you are connected).
It's a 9node cluster running inside of Docker on CoreOS. I've only looked more closely at r104 (see Archive.zip below), which stopped working on 7/31 and then saw some more activity on 8/3: The problematic member here is node1. In grepping the logs, I saw that node1 briefly got the lease on 7/31. The next activity is on 8/3, when it receives 3-4 snapshots including almost no log entries.
There are two other ranges that are perhaps not comparable. In particular, one of the two has a 120mb raft log.
I have no bandwidth to investigate this further. It's fairly tedious due to the remote connection and the fact that this is a 9node cluster. Still, we should follow through and gather what we can. I tried to enable lower-level raft logging via access to /debug/vmodule/raft=8 but somehow it didn't work. That plus grep for
r104/
should turn up something.Inlined my initial investigation comment below:
The text was updated successfully, but these errors were encountered: