-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VIP unbound after more than 10s when privious leader network recovered #266
Comments
I made an initial investigation that this issue may be related with the etcd clientv3 keepalive parmeters. The default value set in vip-manager is 5 seconds for DialKeepAlive and 5 seconds for DialKeepAliveTimeTimeout. So when network down and recovered, etcd client may take 10 seconds to re-establish connection with etcd endpoints. Could we decrease the default values to re-establish connection more quickly? For example, set 2 seconds for DialKeepAlive and 1 second for DialKeepAliveTimeTimeout. Looking forward to your opinion.
|
Well, that will certainly affect unstable connections. Meaning vipm will try to remove VIP more frequently. But I'm ok with such an aggressive settings. Would you mind to create a pull request? |
Thanks for your confirmation. I'm not sure about the optimal timer, will montor if any effect after this change. |
When previous leader network down, patroni failovered to new leader and vip-manager bound VIP to new leader. But when previous leader network recovered, VIP still bound on previous leader with disired state true and only unbound after more than 10s with disired state false. During this period before VIP unbound on previous leader, new connection to VIP may connect to previous leader, which was not leader any more.
Is there any solution to elminate the VIP conflict when previous leader network recovered? Thanks.
The text was updated successfully, but these errors were encountered: