-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: gossip thrashing on 64-node cluster #17610
Comments
Ugh, I thought we had dealt with this. Mind posting some of the logs here or sharing on Drive? I'm curious whether there were a lot of a lot of grpc |
Getting back to this, it looks possible that the gossip thrashing is a symptom of some other problem, not the cause -- the number of client connections doubled and the QPS dropped to 0 a minute or two before gossip showed any signs of problems, so it'd be nice to understand what caused the rest of the major problem before spending too much time thinking about gossip. However, from the logs it does appear that gossip's internal workings certainly weren't helping once things got kicked off. There are definitely a couple of ridiculously easy improvements that shouldn't cost too much.
|
Alternatively to either of the above ideas, we could change our logic that currently culls connections every minute. I'm not sure how beneficial it is in practice. I'm curious if @spencerkimball has any thoughts left over from when he wrote this stuff. |
I think idea #1 is going to end up helping a decent amount. We can be a bit less strict in terms of how many connections we allow. |
@petermattis is this running on sky now? I wouldn't expect anything more than #17633 to be needed, but I could be wrong. |
Yep, it is running on |
Closing for 1.1 since sky's gossip graphs look much better from 8/17-8/18 than from its previous deployment. |
The cluster was created at 21:04. At 23:41 we start seeing a lot of gossip thrashiness which is combined with weird client disconnections (perhaps unrelated). I have the logs from the nodes saved.
The text was updated successfully, but these errors were encountered: