Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: gossip thrashing on 64-node cluster #17610

Closed
petermattis opened this issue Aug 12, 2017 · 7 comments
Closed

perf: gossip thrashing on 64-node cluster #17610

petermattis opened this issue Aug 12, 2017 · 7 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

screen shot 2017-08-11 at 8 39 32 pm

The cluster was created at 21:04. At 23:41 we start seeing a lot of gossip thrashiness which is combined with weird client disconnections (perhaps unrelated). I have the logs from the nodes saved.

@petermattis petermattis added this to the 1.1 milestone Aug 12, 2017
@a-robinson
Copy link
Contributor

Ugh, I thought we had dealt with this. Mind posting some of the logs here or sharing on Drive? I'm curious whether there were a lot of a lot of grpc EnhanceYourCalm errors messing with the connections while this was happening. They seem to be popping up a lot lately.

@a-robinson
Copy link
Contributor

a-robinson commented Aug 12, 2017

Getting back to this, it looks possible that the gossip thrashing is a symptom of some other problem, not the cause -- the number of client connections doubled and the QPS dropped to 0 a minute or two before gossip showed any signs of problems, so it'd be nice to understand what caused the rest of the major problem before spending too much time thinking about gossip.

However, from the logs it does appear that gossip's internal workings certainly weren't helping once things got kicked off. There are definitely a couple of ridiculously easy improvements that shouldn't cost too much.

  1. We're pretty restrictive about how many gossip connections we allow each node to have. At all sizes up through 81 nodes, we only allow each node 3 connections. We only allow 4 connections for up to 256 nodes (5 connections for up to 625 nodes, and so on for n^4). Given how good we are about using high watermarks to avoid sending redundant data, would it hurt much to allow more? It'd certainly help avoid thrashing like this, so I think it's worth trying.
  2. When we reject an incoming client because we already have too many incoming gossip connections, we only ever refer them to one of our immediate client peers. This means that if node x is distant from a lot of other nodes (due to a narrow path to it or for whatever other reason), it's going to reject a lot of connections, which means its clients will reject a lot of connections, and so on and so forth. It seems as though being less restrictive about who we forward refused clients to would help. For example, even being conservative and just forwarding them to any node that's up to 2 hops away (rather than 1) would avoid a lot of refused connections. It's possible, though, that these rejected connections aren't really that expensive and don't need to be optimized.

@a-robinson
Copy link
Contributor

Alternatively to either of the above ideas, we could change our logic that currently culls connections every minute. I'm not sure how beneficial it is in practice. I'm curious if @spencerkimball has any thoughts left over from when he wrote this stuff.

@spencerkimball
Copy link
Member

I think idea #1 is going to end up helping a decent amount. We can be a bit less strict in terms of how many connections we allow.

@a-robinson
Copy link
Contributor

@petermattis is this running on sky now? I wouldn't expect anything more than #17633 to be needed, but I could be wrong.

@petermattis
Copy link
Collaborator Author

Yep, it is running on sky as of a few minutes ago.

@a-robinson
Copy link
Contributor

Closing for 1.1 since sky's gossip graphs look much better from 8/17-8/18 than from its previous deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants