VIP unbound after more than 10s when privious leader network recovered #266

XiuhuaRuan · 2024-10-24T07:45:58Z

When previous leader network down, patroni failovered to new leader and vip-manager bound VIP to new leader. But when previous leader network recovered, VIP still bound on previous leader with disired state true and only unbound after more than 10s with disired state false. During this period before VIP unbound on previous leader, new connection to VIP may connect to previous leader, which was not leader any more.
Is there any solution to elminate the VIP conflict when previous leader network recovered? Thanks.

XiuhuaRuan · 2024-10-28T06:35:32Z

I made an initial investigation that this issue may be related with the etcd clientv3 keepalive parmeters. The default value set in vip-manager is 5 seconds for DialKeepAlive and 5 seconds for DialKeepAliveTimeTimeout. So when network down and recovered, etcd client may take 10 seconds to re-establish connection with etcd endpoints. Could we decrease the default values to re-establish connection more quickly? For example, set 2 seconds for DialKeepAlive and 1 second for DialKeepAliveTimeTimeout. Looking forward to your opinion.

	DialKeepAliveTimeout: 5 * time.Second,
	DialKeepAliveTime:    5 * time.Second,

pashagolub · 2024-10-28T15:27:19Z

Well, that will certainly affect unstable connections. Meaning vipm will try to remove VIP more frequently. But I'm ok with such an aggressive settings. Would you mind to create a pull request?

XiuhuaRuan · 2024-10-31T03:09:19Z

Thanks for your confirmation. I'm not sure about the optimal timer, will montor if any effect after this change.

pashagolub self-assigned this Oct 28, 2024

pashagolub added the enhancement label Oct 28, 2024

pashagolub added this to vip-manager Oct 28, 2024

github-project-automation bot moved this to To do in vip-manager Oct 28, 2024

pashagolub added a commit that referenced this issue Oct 30, 2024

[-] decrease dial keep-alive timeouts, closes #266

afa37a2

pashagolub linked a pull request Oct 30, 2024 that will close this issue

[-] decrease dial keep-alive timeouts for etcd, closes #266 #267

Merged

pashagolub closed this as completed in #267 Oct 30, 2024

pashagolub added a commit that referenced this issue Oct 30, 2024

[-] decrease dial keep-alive timeouts for etcd, closes #266 (#267)

e714054

github-project-automation bot moved this from To do to Done in vip-manager Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VIP unbound after more than 10s when privious leader network recovered #266

VIP unbound after more than 10s when privious leader network recovered #266

XiuhuaRuan commented Oct 24, 2024

XiuhuaRuan commented Oct 28, 2024

pashagolub commented Oct 28, 2024

XiuhuaRuan commented Oct 31, 2024

VIP unbound after more than 10s when privious leader network recovered #266

VIP unbound after more than 10s when privious leader network recovered #266

Comments

XiuhuaRuan commented Oct 24, 2024

XiuhuaRuan commented Oct 28, 2024

pashagolub commented Oct 28, 2024

XiuhuaRuan commented Oct 31, 2024