-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5-node cluster unavailable for an hour after one server reboots #20114
Comments
Thanks for sharing, @gpaul. Would you be willing to share the full log files with me? If you don't want to share them publicly, you can always send them to me at [email protected]. This isn't ringing any bells for me on first glance, and being able to piece more of the info together should help. I assume this is the only time you've hit this problem? |
Hi Alex, I believe the same customer ran into this situation twice. Neither us nor them have been able to reproduce it since. I'll email you the full logs. Thanks for taking a look at this. |
Thanks for sending the logs. I haven't made any conclusions yet, but am going to jot down some thoughts as I go.
I have a few questions for you that might help narrow things down:
|
Thanks for the deep dive @a-robinson. I'll try and get the requested information. |
The reboot nuked the logs from the server we restarted. I suspect it may have seen a somewhat higher request rate given that it was the mesos leader, but not much.
All queries hung.
It is very unlikely that any connection changed - there is a single client per cockroachdb instance and nothing external had any effect.
There is very little data in the database. On the order of ~100-1,000 records in total.
We may be able to get such metrics next week once everyone is back from the holidays. |
It really looks a lot like #19165 - could you provide me with the steps to reproduce that issue? I'm going to need to provide root cause analysis of this issue and a likely culprit combined with a reproduction thereof would go a long way. |
I was mostly curious about after the restart - was it getting a similar amount of requests? One of the main symptoms of #19165 was that the restarted node could serve requests but all of the other would hang.
I'm still not convinced that it's the same problem - getting more metrics from before/during/after the incident would be really helpful. But if you want to reproduce that, the steps are:
|
cc @m-schneider @tschottdorf in case they have any more to say based on their looking into #19165. |
OK, those logs I have. The restarted instance could definitely not serve any requests while the others were unavailable. |
I did some spelunking of the logs. During the time period the node was down ( After Like @a-robinson, I noticed that |
@gpaul provided logs and Similar to the previous incident, a node was taken down and restarted after ~5m. At Rebalancing of replicas away from Very shortly after Of interest to the user are the ranges corresponding to their tables. The leases for
The logs from all nodes indicate lots of slow node liveness heartbeats until Interestingly, the last successful request in the user's logs (from their application) is at
It is silly that we're printing out the liveness record in 2 different formats, but what that is saying is that Node liveness seems to be near the root of the issue. But why is node liveness having so many problems. Stepping away to grab some eats. Will continue debugging this. |
It looks like node 1's application was the only one able to keep making progress until 08:39:37; two stopped at 08:38:01 and the other (that we have logs for) stopped at 08:39:21. Node 1 is also the only node where I see a slowdown before the stall; all others went from serving successful low-latency responses to hanging with no slowdown. They did all serve some very slow requests while n3 was down before it restarted, though (grep for |
The slow requests around |
Yeah, although it's weird they're 14s-18s rather than, say, 10s. |
Looking at stack traces from
Another large chunk are in a related piece of code:
|
The mass of goroutines with stacks indicating they are trying to acquire or release table leases points to the lease table having a problem. The lease table is a single range ( |
Grep'ing for
Strange that both The timing of the Raft snapshot is also right around when the cluster recovered. |
Almost exactly an hour and a half after n3 came back up, n1 got partially "unstuck", kicking off tons of activity related to r6:
There's also some of these messages mixed in:
|
Is it possible there was some sort of schema change ongoing around the time of the incident? |
That would explain the table lease activity. In steady state we don't do any KV operations for table leases. |
Also, a lot of the keys being touched are quite old. Why would we have transactions anchored on keys like |
That's the lease table primary key. |
Node 3 was the leaseholder for r6 when it went down -- it had been the leaseholder since Nov 14. Node 4 took over the lease at 8:33:30.465. The lease history is a little odd, though. It has 3 consecutive entries in the range lease history, all with the same epoch and start time, with the last 2 even having the same proposed time. Maybe it's just a bug in the lease history tracking code, but that seems odd:
|
Ooh, the raft state is interesting. Not only did we have a leader-not-leaseholder situation, with node 3 being the raft leader while node 4 was the leaseholder, node 4's replica (replica 3) was in state
That could certainly prevent progress. Time to figure out what "paused" means... |
I believe |
n4 was stuck in an older term, in which it thought that n1 was the leader:
It has the proper hardstate, though, it just hasn't caught up applying commands. Did it perhaps get stuck? A deadlock, maybe? Checking now. |
A follower is |
Yes! It's good to have confirmation of something that could have kicked off all the lease-related requests.
It still may not be enough, but you should probably switch from local-SSD to HDD to increase the odds. |
Would have been nice when looking into cockroachdb#20114
That doesn't appear to have been necessary. I let 92k table leases accumulate and then caused them to be dropped by setting a same cluster setting. Throughput almost immediately dropped to 0 and the number of goroutines in the system jumped to 240k (from a baseline of 1.2k). Throughput has remained at 0 for 15min and the cluster shows no signs of recovering. One node was killed due to memory growth. I'm not noticing any slow heartbeat stuff in the logs, but they are filled with |
Would have been nice when looking into cockroachdb#20114
If a cluster is running v1.0.6 and has built up a large amount of stale leases is it safe to perform a rolling upgrade to v1.1.3+ without triggering an outage? If so, what will happen to the ~100k stale leases? If I understand correctly those correspond to physical records in a |
Yes, the table leases correspond go physical records in the |
Would it make sense to just delete old leases? The
|
Is your suggestion to manually delete the old leases? That will likely run afoul of internal invariants that only the node creating a lease should delete it. If your suggestion is that Cockroach should delete the old leases, well, that's exactly the bug. It should be deleting these old leases except for a bug in 1.0.x prevented the old leases from being deleted in a fairly common scenario. Setting a cluster setting triggers an additional code path which does cause the old leases to be deleted, but there is no throttling on that deletion and Cockroach spawns a goroutine per deletion. This is a massive spike in traffic and overwhelms parts of the system leading to the freeze up. |
That was what I had in mind, yes. It makes sense that this could lead to issues, but I was hoping it wouldn't. I have several clusters running for a long time that will likely run into this issue, too. I'd really like to prevent them from experiencing a major outage. |
I think you should be able to perform a rolling restart of the Cockroach nodes, but let me confer with others before endorsing that action. |
OK - are you referring to a rolling upgrade or a rolling restart of the affected instances? Would you mind elaborating on why that might solve the issue when you determine that it will? To summarize, I have several clusters that have no choice but to perform triggering actions (these are baked into releases that cannot be skipped) and if there is any way to clean out the stale leases as a pre-upgrade step and thereby avoid triggering the outage then that would be really really good. |
Considering that we're looking at multi-hour outages, even some set of steps that necessitates stopping the cockroachdb instances or causing a much shorter outage would help. |
I'm referring to a rolling restart of the affected instances. When a rolling restart is performed, the "leaked" leases on the restarted node will no longer be candidates for deletion (they will not be picked up by the restarted node). This effectively causes the leases to remain in the You'll want to make sure that the rolling restart is just the cockroach nodes. You would definitely not want to accidentally trigger deletion of the leases (e.g. by setting a cluster setting). |
This is great news, thank you. I'm happy to perform a rolling restart of the affected cockroachdb instances before upgrading. |
We can later investigate deleting the "leaked" leases from the |
I have a test cluster I can test this on - thanks, doing so now. |
I can confirm that a rolling restart, followed by a trigger action, does not cause an outage. |
Fantastic. Can you also confirm that there are still a lot of leases in |
That's less than I would expect and all of them are set to expire in the last day or two. An exact duplicate of this cluster had 22k before I triggered an outage. Hmm - I wish I'd checked this number before performing the rolling restart and subsequent trigger action. I'll have to wait a few days for these clusters to build up some stale leases again before I'm able to confirm beyond a doubt that it works. |
I might have to take you up on that offer. |
Yeah, that number of leases isn't enough to cause a problem. I'll see about doing a test locally. It probably won't happen until tomorrow or Thursday. |
I borrowed a long running cluster from a colleague and determined that before:
And after restarting and performing a trigger action:
It appears to work perfectly, thanks. |
Closing, thanks for the wonderful support everyone! |
Is this a question, feature request, or bug report?
BUG REPORT
Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of
grep -F '[config]' cockroach-data/logs/cockroach.log
When log files are not available, supply the output of
cockroach version
and all flags/environment variables passed to
cockroach start
instead.There are 5 nodes in the cluster, all 5 share the following cmdline and version:
A cluster of 5 cockroachdb nodes running v1.0.6 was running correctly.
We restarted one of the servers.
After a few minutes we noticed that our queries got slower quickly and then the cluster became unavailable for ~1h10min.
The cluster managed healed itself after an hour and queries started working again.
There is a single database in the cluster and it contains very little data (~1000 records).
There is one client per cockroachdb instance, each of which performs ~ 1req/s.
The cluster remains available for queries.
The server is rebooted at
2017-11-09 14:09:31
Queries on other nodes begin slowing down rapidly after ~5 mins:
At this point, all client queries stall completely, not a single request is served for over an hour.
The first request is served, at normal speed.
At this time the following logs are observed across all 4 remaining cockroachdb instances:
Followed by many hundreds more of these
BeginTransaction
logs until around14:20:44
:The restarted server has booted at this point and its cockroachdb instance has rejoined the cluster (
gossip status is now (ok, 5 nodes)
) and we start seeing lots ofEndTransaction
logs:This leads into another giant batch of logs about failed heartbeats:
And then it all goes back to normal by the last line below:
The text was updated successfully, but these errors were encountered: