-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loadbalancer reentrant rlock #10511
Fix loadbalancer reentrant rlock #10511
Conversation
Signed-off-by: Brad Davidson <[email protected]>
Awesome work, this seemed like a pain to track down. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10511 +/- ##
==========================================
- Coverage 49.47% 43.33% -6.15%
==========================================
Files 179 179
Lines 14924 14936 +12
==========================================
- Hits 7384 6472 -912
- Misses 6161 7267 +1106
+ Partials 1379 1197 -182
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
We shouldn't be replacing the configured server address on agents. Doing so breaks the agent's ability to fall back to the fixed registration endpoint when all servers are down, since we replaced it with the first discovered apiserver address. The fixed registration endpoint will be restored as default when the service is restarted, but this is not the correct behavior. This should have only been done on etcd-only nodes that start up using their local supervisor, but need to switch to a control-plane node as soon as one is available. Signed-off-by: Brad Davidson <[email protected]>
5bb1f72
to
b669642
Compare
I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review... Signed-off-by: Brad Davidson <[email protected]>
This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future. Signed-off-by: Brad Davidson <[email protected]>
b669642
to
0c02a65
Compare
Proposed Changes
When I added health-checks in Add health-check support to loadbalancer #9757 and promoted the loadbalancer mutex to a rwmutex, I also added a readlock call to dialContext since it is accessing the servers list. As dialContext calls nextServer, which also takes a readlock, this can now deadlock if another goroutine attempts to acquire a writelock while dialContext is in its loop. dialContext will hold a readlock, but the goroutine attempting to acquire a writelock will cause nextServer's readlock to block, which prevents the outer readlock from ever being released.
nextServer is only ever called from dialContext, so it doesn't need to take another lock. I should have removed the lock from nextServer when I added it to dialContext.
The configured fixed registration address was being replaced with the first control-plane node address, which prevented the fixed registration address from being used if all discovered control-plane addresses are unavailable.
The loadbalancer does not properly bind to the ipv6 loopback when the node has an ipv6 primary node ip, as the comma-separated flag value containing both IPs cannot be parsed as a valid ipv6 address. Found when attempting to reproduce this issue on a node with an ipv6 primary node ip (ie
--node-ip=fd7c:53a5:aef5::242:ac11:8,172.17.0.8
)This should give us more detail on how long dials take before failing, so that we can perhaps better tune the loadbalancer fail over loop in the future.
Types of Changes
bugfix
Verification
Since this is a locking issue, it requires specific timing to reproduce. The best way I have found to reproduce it requires taking the server an agent is connected to off the network, so that attempts to connect to it time out. Since a lock is held while connecting, this makes the issue more likely to trigger
This is easier to reproduce on rke2, but both distros should be affected
netstat -na | grep 6443
ip link set dev eth0 down
(or whatever interface that node is using)Testing
I will add some tests to the loadbalancer in the August release cycle to prevent things like this from happening again.
pkg/agent/loadbalancer
#10505Linked Issues
User-Facing Change
Further Comments