Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loadbalancer reentrant rlock #10511

Merged
merged 4 commits into from
Jul 15, 2024

Conversation

brandond
Copy link
Member

@brandond brandond commented Jul 12, 2024

Proposed Changes

  • Fix reentrant rlock in loadbalancer.dialContext
    When I added health-checks in Add health-check support to loadbalancer #9757 and promoted the loadbalancer mutex to a rwmutex, I also added a readlock call to dialContext since it is accessing the servers list. As dialContext calls nextServer, which also takes a readlock, this can now deadlock if another goroutine attempts to acquire a writelock while dialContext is in its loop. dialContext will hold a readlock, but the goroutine attempting to acquire a writelock will cause nextServer's readlock to block, which prevents the outer readlock from ever being released.
    nextServer is only ever called from dialContext, so it doesn't need to take another lock. I should have removed the lock from nextServer when I added it to dialContext.
  • Fix agents removing configured supervisor address
    The configured fixed registration address was being replaced with the first control-plane node address, which prevented the fixed registration address from being used if all discovered control-plane addresses are unavailable.
  • Fix IPv6 primary node-ip handling
    The loadbalancer does not properly bind to the ipv6 loopback when the node has an ipv6 primary node ip, as the comma-separated flag value containing both IPs cannot be parsed as a valid ipv6 address. Found when attempting to reproduce this issue on a node with an ipv6 primary node ip (ie --node-ip=fd7c:53a5:aef5::242:ac11:8,172.17.0.8)
  • Add dial duration to debug error message
    This should give us more detail on how long dials take before failing, so that we can perhaps better tune the loadbalancer fail over loop in the future.

Types of Changes

bugfix

Verification

Since this is a locking issue, it requires specific timing to reproduce. The best way I have found to reproduce it requires taking the server an agent is connected to off the network, so that attempts to connect to it time out. Since a lock is held while connecting, this makes the issue more likely to trigger

This is easier to reproduce on rke2, but both distros should be affected

  1. Create a cluster with 3 servers and 1 agent
  2. Identify the server that the agent is connected to : netstat -na | grep 6443
  3. Disconnect the network on that server: ip link set dev eth0 down (or whatever interface that node is using)
  4. Note that the agent's connection to that server times out, and the agent switches over to a new server - but the failed server is never removed from the server list:
    Jul 12 07:51:31 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:31Z" level=error msg="Remotedialer proxy error; reconnecting..." error="read tcp 172.17.0.11:51850->172.17.0.8:9345: i/o timeout" url="wss://172.17.0.8:9345/v1-rke2/connect"
    Jul 12 07:51:32 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:32Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:9345/v1-rke2/connect"
    Jul 12 07:51:32 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:32Z" level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: 172.17.0.8:6443 -> 172.17.0.9:6443"
    
    If the issue is NOT reproduced, you will see the failed server removed from the load balancer:
    Jul 12 07:45:42 systemd-node-4 rke2[2189]: time="2024-07-12T07:45:42Z" level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: 172.17.0.8:6443"
    Jul 12 07:45:42 systemd-node-4 rke2[2189]: time="2024-07-12T07:45:42Z" level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [172.17.0.10:6443 172.17.0.9:6443] [default: 172.17.0.100:6443]"
    
  5. Note that attempts to use the agent loadbalancer to reach the apiserver time out:
    root@systemd-node-4:/# export KUBECONFIG=/var/lib/rancher/rke2/agent/kubelet.kubeconfig
    root@systemd-node-4:/# kubectl get node -o wide
    E0712 07:58:06.786096   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
    E0712 07:58:16.787510   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
    E0712 07:58:26.788272   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
    E0712 07:58:36.789182   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
    E0712 07:58:46.790039   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
    Unable to connect to the server: net/http: TLS handshake timeout

Testing

I will add some tests to the loadbalancer in the August release cycle to prevent things like this from happening again.

Linked Issues

User-Facing Change

Fixed an issue that could cause the agent loadbalancer to deadlock when the currently in-use server goes down.

Further Comments

@brandond brandond requested a review from a team as a code owner July 12, 2024 17:28
dereknola
dereknola previously approved these changes Jul 12, 2024
@dereknola
Copy link
Member

Awesome work, this seemed like a pain to track down.

vitorsavian
vitorsavian previously approved these changes Jul 12, 2024
Copy link

codecov bot commented Jul 12, 2024

Codecov Report

Attention: Patch coverage is 39.28571% with 17 lines in your changes missing coverage. Please review.

Project coverage is 43.33%. Comparing base (58ab259) to head (0c02a65).
Report is 1 commits behind head on master.

Files Patch % Lines
pkg/agent/run.go 11.11% 8 Missing ⚠️
pkg/agent/tunnel/tunnel.go 61.53% 4 Missing and 1 partial ⚠️
pkg/agent/config/config.go 0.00% 2 Missing ⚠️
pkg/daemons/agent/agent_linux.go 0.00% 0 Missing and 2 partials ⚠️

❗ There is a different number of reports uploaded between BASE (58ab259) and HEAD (0c02a65). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (58ab259) HEAD (0c02a65)
e2etests 7 6
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10511      +/-   ##
==========================================
- Coverage   49.47%   43.33%   -6.15%     
==========================================
  Files         179      179              
  Lines       14924    14936      +12     
==========================================
- Hits         7384     6472     -912     
- Misses       6161     7267    +1106     
+ Partials     1379     1197     -182     
Flag Coverage Δ
e2etests 36.24% <39.28%> (-10.09%) ⬇️
inttests 19.72% <7.14%> (+<0.01%) ⬆️
unittests 13.35% <7.14%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

We shouldn't be replacing the configured server address on agents. Doing
so breaks the agent's ability to fall back to the fixed registration
endpoint when all servers are down, since we replaced it with the first
discovered apiserver address. The fixed registration endpoint will be
restored as default when the service is restarted, but this is not the
correct behavior. This should have only been done on etcd-only nodes
that start up using their local supervisor, but need to switch to a
control-plane node as soon as one is available.

Signed-off-by: Brad Davidson <[email protected]>
@brandond brandond dismissed stale reviews from vitorsavian and dereknola via b669642 July 14, 2024 00:04
@brandond brandond force-pushed the fix-loadbalancer-reentrant-rlock branch from 5bb1f72 to b669642 Compare July 14, 2024 00:04
I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review...

Signed-off-by: Brad Davidson <[email protected]>
This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future.

Signed-off-by: Brad Davidson <[email protected]>
@brandond brandond force-pushed the fix-loadbalancer-reentrant-rlock branch from b669642 to 0c02a65 Compare July 14, 2024 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants