Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user cluster has three wedged timeseries ranges #17524

Closed
tbg opened this issue Aug 8, 2017 · 27 comments · Fixed by #17796
Closed

user cluster has three wedged timeseries ranges #17524

tbg opened this issue Aug 8, 2017 · 27 comments · Fixed by #17796
Assignees
Milestone

Comments

@tbg
Copy link
Member

tbg commented Aug 8, 2017

Forked from #17478 (comment).

We have access to this cluster via a Cisco WebEx session (you'll need a chrome plugin) at a meeting link that can be activated by Gitter user @HeikoOnnebrink (he'll supervise while you are connected).

It's a 9node cluster running inside of Docker on CoreOS. I've only looked more closely at r104 (see Archive.zip below), which stopped working on 7/31 and then saw some more activity on 8/3: The problematic member here is node1. In grepping the logs, I saw that node1 briefly got the lease on 7/31. The next activity is on 8/3, when it receives 3-4 snapshots including almost no log entries.

There are two other ranges that are perhaps not comparable. In particular, one of the two has a 120mb raft log.

I have no bandwidth to investigate this further. It's fairly tedious due to the remote connection and the fact that this is a 9node cluster. Still, we should follow through and gather what we can. I tried to enable lower-level raft logging via access to /debug/vmodule/raft=8 but somehow it didn't work. That plus grep for r104/ should turn up something.

Inlined my initial investigation comment below:

I should also mention that the reason node 5 is unwilling to become raft leader is because it doesn't have an entry for itself in its own progress map (prs map[uint64]*Progress), which make it un-"promotable". This seems pretty surprising, but I'm not familiar with the expectations for it.

I thought the Progress is only populated on the Raft leader. Which piece of code are you talking about?

I'm currently looking at a user's cluster who also lost his timeseries ranges. Data attached.

There's definitely a problem with the quota pool on that cluster. @irfansharif, any thoughts on the below? It might be an artifact of the ranges being horked, which the attached range pages can hopefully illustrate.
Archive.zip

goroutine 35976949 [chan receive, 6999 minutes]:
github.com/cockroachdb/cockroach/pkg/util/timeutil.(*Timer).Reset(0xc422777740, 0xdf8475800)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/timeutil/timer.go:89 +0x9f
github.com/cockroachdb/cockroach/pkg/storage.(*quotaPool).acquire(0xc420e88c30, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/quota_pool.go:198 +0x69a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeAcquireProposalQuota(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x8, 0x14d7740974383c7f)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:899 +0xd9
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x14d76b9159239640, 0x0, 0x0, 0x0, 0x700000007, 0xc, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2817 +0x69c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4204
@tbg
Copy link
Member Author

tbg commented Aug 8, 2017

@cuongdo for triage.

@cuongdo cuongdo assigned nvanbenschoten and unassigned cuongdo Aug 8, 2017
@cuongdo
Copy link
Contributor

cuongdo commented Aug 8, 2017

@nvanbenschoten please take a look at this

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 8, 2017

Yeah I'll take a look.

@dianasaur323 dianasaur323 added this to the 1.1 milestone Aug 18, 2017
@dianasaur323
Copy link
Contributor

I've put in a milestone for this so that it doesn't get lost in the large unknown of no milestones! Feel free to move around and adjust as necessary.

@nvanbenschoten
Copy link
Member

The cluster's logs and debug pages revealed a few interesting things about the state of @HeikoOnnebrink's cluster. The first is that Range 1 and Range 2 were both cycling through leadership terms repeatedly. The leader of each range was bouncing back and forth between two nodes, and a leadership election was kicking off every 10 seconds. The term numbers for the ranges were around 180,000 when I was looking, meaning that at the current rate, the election cycle must have been spinning for over a week. This seemed like a pretty serious issue, but @tschottdorf mentioned that this may be because the cluster had no load on it, and because this part of the keyspace uses expiration-based leases.

It was also clear that the command queue was clogged up and that no progress was being made for critical parts of the keyspace. Some of these stuck commands were writes to the NodeLiveness keys. These writes being stuck would explain why the Admin UI showed all nodes as down even though they were accessible through the debug pages.

W170821 04:46:13.255476 50510552 storage/replica.go:2053  [n3,s3,r2/35:/System/{-tsd}] have been waiting 1m0s for dependencies: cmds:
[{global:<nil> local:0xc42ba025a0} {global:0xc425744c00 local:0xc42ba02780}]
global:
  12509 [/System/NodeLiveness/1)
  12510 [/System/NodeLiveness/1)
  12538 [/System/NodeLiveness/1)
  12539 [/System/NodeLiveness/1)
  12541 [/System/NodeLiveness/1)
  12542 [/System/NodeLiveness/1)
  12550 [/System/NodeLiveness/1)
  12551 [/System/NodeLiveness/1)
  12553 [/System/NodeLiveness/1)
  12554 [/System/NodeLiveness/1)
  ...remaining 1570 writes omitted
local:  38011 readonly [/Local/RangeID/2/r/AbortCache/"00120c67-fdc1-4587-abf1-f94404df0aac")
  32702 readonly [/Local/RangeID/2/r/AbortCache/"00215e7b-d4f3-449c-98d1-41dca801cd81")
  35723 readonly [/Local/RangeID/2/r/AbortCache/"009f4ab1-8c79-4dd1-8ddf-da321ce77463")
  36766 readonly [/Local/RangeID/2/r/AbortCache/"00ed3553-c482-4a9f-986c-90f1176637b1")
  33064 readonly [/Local/RangeID/2/r/AbortCache/"00fca254-7dbb-4c34-8b59-7fc78b960a50")
  33774 readonly [/Local/RangeID/2/r/AbortCache/"010ddfe7-b549-4d35-b7d4-5ee34fd596cf")
  35402 readonly [/Local/RangeID/2/r/AbortCache/"017afca7-9c7c-4b88-ad5f-2f79a64ab630")
  35215 readonly [/Local/RangeID/2/r/AbortCache/"01a17c4f-7c42-4084-88c1-57d087ff6256")
  34904 readonly [/Local/RangeID/2/r/AbortCache/"02106367-a8d3-456c-8833-49552d83e174")
  38422 readonly [/Local/RangeID/2/r/AbortCache/"02296af8-879d-4d03-88ae-3d319a9fa5ad")
  ...remaining 2360 reads omitted
  38016 [/Local/RangeID/2/r/AbortCache/"00120c67-fdc1-4587-abf1-f94404df0aac")
  32707 [/Local/RangeID/2/r/AbortCache/"00215e7b-d4f3-449c-98d1-41dca801cd81")
  35728 [/Local/RangeID/2/r/AbortCache/"009f4ab1-8c79-4dd1-8ddf-da321ce77463")
  36771 [/Local/RangeID/2/r/AbortCache/"00ed3553-c482-4a9f-986c-90f1176637b1")
  33069 [/Local/RangeID/2/r/AbortCache/"00fca254-7dbb-4c34-8b59-7fc78b960a50")
  33779 [/Local/RangeID/2/r/AbortCache/"010ddfe7-b549-4d35-b7d4-5ee34fd596cf")
  35407 [/Local/RangeID/2/r/AbortCache/"017afca7-9c7c-4b88-ad5f-2f79a64ab630")
  35220 [/Local/RangeID/2/r/AbortCache/"01a17c4f-7c42-4084-88c1-57d087ff6256")
  34909 [/Local/RangeID/2/r/AbortCache/"02106367-a8d3-456c-8833-49552d83e174")
  38427 [/Local/RangeID/2/r/AbortCache/"02296af8-879d-4d03-88ae-3d319a9fa5ad")
  ...remaining 2362 writes omitted

Finally, the goroutine dump showed a few channel selects that had been blocked for days. The anomalous chan receive was actually the one @tschottdorf has linked to above. There, the quotaPool was deadlocked on a call to timeutil.Timer.Reset. This deadlock meant that a write command was stuck in the CommandQueue, blocking all other commands while ignoring context cancellations. The problem leading to the deadlock is that Go's timer implementation is notoriously racey, so we've built a race-free infrastructure around it in the form of timeutil.Timer. Unfortunately, to avoid the race, a bit of logic needs to be pushed out to each client of timeutil.Timer in the form of setting timer.Read = true whenever the timer channel is read from. Since this logic is fragile, I'm going to fix this bug and add a linter to assure that we never introduce deadlocks because of this again.

I'm not positive this was the only issue with @HeikoOnnebrink's test cluster, but fixing the deadlock is a step in the right direction. We discussed that once this fix gets merged he'll pull down master and perform a rolling upgrade, at which point we can continue to monitor the cluster for any other issues.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 22, 2017
Maybe fixes cockroachdb#17524.

When a `timeutil.Timer` is read from, it needs to set `timer.Read = true`
before calling `timer.Reset` or it risks deadlocking itself. A deadlock
in the `quotaPool` was possible because it missed this step. Once the
`quotaPool` deadlocked, it would block the `CommandQueue` and wreak havoc
on a cluster.
@cuongdo
Copy link
Contributor

cuongdo commented Aug 22, 2017

Thanks for looking into this @nvanbenschoten and @tschottdorf! Are there metrics/graphs or warnings that can avoid this type of issue in the future? For example, maybe a new metric for the age of the oldest entry in the command queue? Raft election counts?

@petermattis
Copy link
Collaborator

We have metrics for slow requests in distsender, raft proposals, the command-queue and lease acquisition. We also have metrics for Raft elections. Were any of these showing something interesting?

@nvanbenschoten
Copy link
Member

The command-queue logging that I pasted above along was pretty interesting as it clearly demonstrated that a command was stuck. Raft leader election logging was also interesting, though I didn't realize that we already have metrics for it.

Perhaps we could create a tool/debug endpoint to parse a goroutine dump and pick out any surprising results. It wouldn't give any insight into livelock conditions, but could certainly help with more tame deadlock conditions like we had here.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2017

integrating a version of panicparse + grep (with the regex applying to each individual stacktrace) could be helpful.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2017

Did you mean to close this, btw? Should probably reopen until the cluster is in at least a few days' worth of good state.

@tbg tbg reopened this Aug 22, 2017
@nvanbenschoten
Copy link
Member

TIL "Maybe fixes" still closes the corresponding issue.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2017

"It's basically impossible that this fixes #17524". 😄

@nvanbenschoten
Copy link
Member

Another round of debugging revealed more issues with the cluster. It still looks like we're having issue with the NodeLiveness, and there are some clues as to why.

On the cluster's node 6, we see that 20 goroutines are blocked while acquiring a semaphore in NodeLiveness.heartbeatInternal. A 21st goroutine has the semaphore and has proceeded through updateLiveness. We can see that about a day ago the heartbeat began by sending a request to node 8. It quickly recevied a NotALeaseholderError, so it redirected the request to node 3 instead. On node 3 we can see the request, and it's stuck in an endless loop of retry proposal: reasonTicks traces.


2017/08/23 09:59:35.532214 | 118314.860080 | replica send
-- | -- | --
09:59:35.532224 | .    10 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] read-write path
09:59:35.532271 | .    47 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] applied timestamp cache
09:59:35.532398 | .   127 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] evaluated request
09:59:35.532405 | .     6 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] acquired {raft,replica}mu
09:59:36.931649 | 1.399245 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] retry proposal 08573d9b0c84c280: reasonTicks
...
10:04:48.961689 | .    61 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] applied timestamp cache
-- | -- | --
10:04:48.961840 | .   151 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] evaluated request
18:46:09.331841 | .     6 | ... (185649 events discarded)
18:46:12.331673 | 2.999832 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] retry proposal 049e200940058276: reasonTicks
18:46:12.331733 | .    60 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] applied timestamp cache
18:46:12.331852 | .   119 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] evaluated request
18:46:12.331860 | .     7 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] acquired {raft,replica}mu
18:46:15.331672 | 2.999813 | ... [n3,s3,r1611/3:/System/NodeLiveness{-Max}] retry proposal 600a3f5e82898f9f: reasonTicks
...

@petermattis
Copy link
Collaborator

Sounds like r1611 is wedged in some way. What does the range debug page say about it?

@nvanbenschoten
Copy link
Member

The debug page shows that the leader and current leaseholder of r1611 is node 9. However, node 3 believes that node 8 was the last leaseholder and that its lease has expired, because it hasn't seen an applied lease since yesterday. Because of this and because the trace doesn't include a mention of the command queue, it looks like it's a lease request that's stuck.

screen shot 2017-08-24 at 4 26 32 pm

After talking to @tschottdorf it sounds like this retry loop is actually expected until the node's Raft log is able to catch up, so something must be going wrong with Raft. Unfortunately, we're unable to bump the vmodule because it got put behind a cluster setting, so it's tricky getting insight into exactly what's going on in Raft. The idea now is to revert this vmodule change and restart node 3 with the new binary.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 30, 2017
While investigating cockroachdb#17524, we found that a NodeLiveness request was stuck, which
blocked the NodeLiveness semaphore. This prevented any liveness updates from
being sent by the source of the stuck liveness update. While introducing a
timeout here on the liveness loop won't fix the issue we saw, it should prevent
stuck liveness updates in the future from preventing a node from ever
updating its liveness record, which makes everything harder to deal with.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Sep 6, 2017
A screenshot in
cockroachdb#17524 (comment)
made me think we might have an off-by-one error here (because it
showed the range starting with NodeLivenessKeyMax using an
expiration-based lease), but it was a false alarm caused by a bug in
the range debug page (cockroachdb#17843). In any case, it's good to test here.
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 6, 2017

After a restart to master, @HeikoOnnebrink was still seeing issues with r1611 (the node liveness range). The range was unavailable, with n4 and n8 bouncing the lease back and forth between each other. n9 was behind and couldn't seem to catch up. Interestingly, n9 still had a number of commands in its CommandQueue.

With @tschottdorf's fancy new LogSpy™ tool, we could see that the leader was trying to catch up n9, but failing. n9 would receive a msgApp and reject it with the message:

[logterm: 34503, index: 874105] rejected msgApp [logterm: 34544, index: 874105]

The leader would then log:

received msgApp rejection(lastIndex: 874105) from 2 for index 874105

This loop continued indefinitely. This was strange because it meant that the follower had a Raft entry at the index in the msgApp, but that the logterm for this index was different than the logterm in the msgApp. This scenerio lined up with this test case in etcd/raft. This shouldn't happen, so it was an indication that something was wrong.

r9 had a committed index of 874105, meaning that when it determined the logterm for this index, it would bypass the unstable log and hit Storage.Term. Our implementation of Storage.Term is:

// Term implements the raft.Storage interface.
func (r *replicaRaftStorage) Term(i uint64) (uint64, error) {
if r.mu.lastIndex == i && r.mu.lastTerm != 0 {
return r.mu.lastTerm, nil
}
// Try to retrieve the term for the desired entry from the entry cache.
if term, ok := r.store.raftEntryCache.getTerm(r.RangeID, i); ok {
return term, nil
}
readonly := r.store.Engine().NewReadOnly()
defer readonly.Close()
ctx := r.AnnotateCtx(context.TODO())
return term(ctx, readonly, r.RangeID, r.store.raftEntryCache, i)
}

The first check is notable. It uses an optimization to cache the lastTerm and return this if the desired index is our lastIndex. If this somehow got out of sync, we would expect to see the retry loop we saw before because we would ignore entries in both the raftEntryCache and entries in RocksDB when determining term(lastIndex). But how would this get out of sync?

r.mu.lastTerm is primarily updated in handleRaftReadyRaftMuLocked . Surprisingly, it doesn't look like we touch it during a snapshot application. If a snapshot included new entries at a different term, a call to Term(last entry added) could still return the previous lastTerm.

@tschottdorf theorized that this could be the case we were seeing because n9's committed index is only 3 entries higher than its truncated index (874102). This implies that it was given a snapshot right after a truncation and that it got stuck here. This would properly explain things. To test this theory we restarted n9 to flush its in-memory state. On reboot, r1611 quickly caught up and the range was unwedged.

I'm looking into writing up a test case to further validate this hypothesis.

@petermattis
Copy link
Collaborator

We clear r.mu.lastTerm in Replica.applySnapshot which disables its usage. I just did a double-check and verified we're updating r.mu.lastTerm whenever we update r.mu.lastIndex. Nothing seems to be missing, though perhaps I missed something.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 7, 2017

But then we replace it with the previous lastTerm again since we never change the local in handleRaftReadyRaftMuLocked!

@petermattis
Copy link
Collaborator

Ah, that's a very good point.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 7, 2017
Addresses the current issue in cockroachdb#17524.

I'll open an issue to properly test this, but that might take some time.
Fow now this seems like a clear fix that we should get in sooner rather
than later.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 7, 2017
Addresses the current issue in cockroachdb#17524.

I'll open an issue to properly test this, but that might take some time.
Fow now this seems like a clear fix that we should get in sooner rather
than later.
@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2017

If restarting n9 allowed the cluster to catch up quickly, why didn't it recover when upgraded to master? Is it getting back into this state after each restart?

@nvanbenschoten
Copy link
Member

It did recover for a while after upgrading to master. I believe it took about 8 hours to get wedged. I'm not sure it's fallen back into this state yet, but I wouldn't be surprised.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2017

Interesting. I wonder why this cluster is getting into this state on a regular basis but we haven't (AFAIK) seen it elsewhere.

@petermattis
Copy link
Collaborator

If this is the cached lastTerm bug, it would require a Raft initiated snapshot. Perhaps we're not generating those very often.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 7, 2017
Addresses the current issue in cockroachdb#17524.

I'll open an issue to properly test this, but that might take some time.
Fow now this seems like a clear fix that we should get in sooner rather
than later.
@dianasaur323
Copy link
Contributor

The problem is actually occurring again on the PM cluster after upgrade. I'm seeing failed snapshot logs again, and there are several "No Lease" and "Underreplicated" ranges.

@tbg
Copy link
Member Author

tbg commented Sep 10, 2017

Let's move this to #18324.

@tbg
Copy link
Member Author

tbg commented Sep 13, 2017

@nvanbenschoten we're done here, correct?

@nvanbenschoten
Copy link
Member

Yeah, let's close this for now. We can reopen if the problem persists or create a new issue if something else comes up from @HeikoOnnebrink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants