Fix loadbalancer reentrant rlock #10511

brandond · 2024-07-12T17:28:59Z

Proposed Changes

Fix reentrant rlock in loadbalancer.dialContext
When I added health-checks in Add health-check support to loadbalancer #9757 and promoted the loadbalancer mutex to a rwmutex, I also added a readlock call to dialContext since it is accessing the servers list. As dialContext calls nextServer, which also takes a readlock, this can now deadlock if another goroutine attempts to acquire a writelock while dialContext is in its loop. dialContext will hold a readlock, but the goroutine attempting to acquire a writelock will cause nextServer's readlock to block, which prevents the outer readlock from ever being released.
nextServer is only ever called from dialContext, so it doesn't need to take another lock. I should have removed the lock from nextServer when I added it to dialContext.
Fix agents removing configured supervisor address
The configured fixed registration address was being replaced with the first control-plane node address, which prevented the fixed registration address from being used if all discovered control-plane addresses are unavailable.
Fix IPv6 primary node-ip handling
The loadbalancer does not properly bind to the ipv6 loopback when the node has an ipv6 primary node ip, as the comma-separated flag value containing both IPs cannot be parsed as a valid ipv6 address. Found when attempting to reproduce this issue on a node with an ipv6 primary node ip (ie --node-ip=fd7c:53a5:aef5::242:ac11:8,172.17.0.8)
Add dial duration to debug error message
This should give us more detail on how long dials take before failing, so that we can perhaps better tune the loadbalancer fail over loop in the future.

Types of Changes

bugfix

Verification

Since this is a locking issue, it requires specific timing to reproduce. The best way I have found to reproduce it requires taking the server an agent is connected to off the network, so that attempts to connect to it time out. Since a lock is held while connecting, this makes the issue more likely to trigger

This is easier to reproduce on rke2, but both distros should be affected

Create a cluster with 3 servers and 1 agent
Identify the server that the agent is connected to : netstat -na | grep 6443
Disconnect the network on that server: ip link set dev eth0 down (or whatever interface that node is using)

Note that the agent's connection to that server times out, and the agent switches over to a new server - but the failed server is never removed from the server list:

Jul 12 07:51:31 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:31Z" level=error msg="Remotedialer proxy error; reconnecting..." error="read tcp 172.17.0.11:51850->172.17.0.8:9345: i/o timeout" url="wss://172.17.0.8:9345/v1-rke2/connect"
Jul 12 07:51:32 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:32Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:9345/v1-rke2/connect"
Jul 12 07:51:32 systemd-node-4 rke2[2189]: time="2024-07-12T07:51:32Z" level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: 172.17.0.8:6443 -> 172.17.0.9:6443"

If the issue is NOT reproduced, you will see the failed server removed from the load balancer:

Jul 12 07:45:42 systemd-node-4 rke2[2189]: time="2024-07-12T07:45:42Z" level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: 172.17.0.8:6443"
Jul 12 07:45:42 systemd-node-4 rke2[2189]: time="2024-07-12T07:45:42Z" level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [172.17.0.10:6443 172.17.0.9:6443] [default: 172.17.0.100:6443]"

Note that attempts to use the agent loadbalancer to reach the apiserver time out:

root@systemd-node-4:/# export KUBECONFIG=/var/lib/rancher/rke2/agent/kubelet.kubeconfig
root@systemd-node-4:/# kubectl get node -o wide
E0712 07:58:06.786096   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
E0712 07:58:16.787510   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
E0712 07:58:26.788272   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
E0712 07:58:36.789182   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
E0712 07:58:46.790039   12756 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": net/http: TLS handshake timeout
Unable to connect to the server: net/http: TLS handshake timeout

Testing

I will add some tests to the loadbalancer in the August release cycle to prevent things like this from happening again.

Add tests to pkg/agent/loadbalancer #10505

Linked Issues

Agent loadbalancer may deadlock when servers are removed #10506

User-Facing Change

Fixed an issue that could cause the agent loadbalancer to deadlock when the currently in-use server goes down.

Further Comments

Signed-off-by: Brad Davidson <[email protected]>

dereknola · 2024-07-12T17:35:15Z

Awesome work, this seemed like a pain to track down.

pkg/agent/run.go

codecov · 2024-07-12T19:33:24Z

Codecov Report

Attention: Patch coverage is 39.28571% with 17 lines in your changes missing coverage. Please review.

Project coverage is 43.33%. Comparing base (58ab259) to head (0c02a65).
Report is 1 commits behind head on master.

Files	Patch %	Lines
pkg/agent/run.go	11.11%	8 Missing ⚠️
pkg/agent/tunnel/tunnel.go	61.53%	4 Missing and 1 partial ⚠️
pkg/agent/config/config.go	0.00%	2 Missing ⚠️
pkg/daemons/agent/agent_linux.go	0.00%	0 Missing and 2 partials ⚠️

❗ There is a different number of reports uploaded between BASE (58ab259) and HEAD (0c02a65). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (58ab259) HEAD (0c02a65)

e2etests 7 6

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10511      +/-   ##
==========================================
- Coverage   49.47%   43.33%   -6.15%     
==========================================
  Files         179      179              
  Lines       14924    14936      +12     
==========================================
- Hits         7384     6472     -912     
- Misses       6161     7267    +1106     
+ Partials     1379     1197     -182

Flag	Coverage Δ
e2etests	`36.24% <39.28%> (-10.09%)`	⬇️
inttests	`19.72% <7.14%> (+<0.01%)`	⬆️
unittests	`13.35% <7.14%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

We shouldn't be replacing the configured server address on agents. Doing so breaks the agent's ability to fall back to the fixed registration endpoint when all servers are down, since we replaced it with the first discovered apiserver address. The fixed registration endpoint will be restored as default when the service is restarted, but this is not the correct behavior. This should have only been done on etcd-only nodes that start up using their local supervisor, but need to switch to a control-plane node as soon as one is available. Signed-off-by: Brad Davidson <[email protected]>

I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review... Signed-off-by: Brad Davidson <[email protected]>

This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future. Signed-off-by: Brad Davidson <[email protected]>

Fix reentrant rlock in loadbalancer.dialContext

e217bd6

Signed-off-by: Brad Davidson <[email protected]>

brandond requested a review from a team as a code owner July 12, 2024 17:28

dereknola previously approved these changes Jul 12, 2024

View reviewed changes

brandond mentioned this pull request Jul 12, 2024

Agent loadbalancer may deadlock when servers are removed rancher/rke2#6208

Closed

vitorsavian previously approved these changes Jul 12, 2024

View reviewed changes

brandond commented Jul 12, 2024

View reviewed changes

pkg/agent/run.go Show resolved Hide resolved

brandond dismissed stale reviews from vitorsavian and dereknola via b669642 July 14, 2024 00:04

brandond force-pushed the fix-loadbalancer-reentrant-rlock branch from 5bb1f72 to b669642 Compare July 14, 2024 00:04

brandond added 2 commits July 14, 2024 00:56

Fix IPv6 primary node-ip handling

6e87ff3

I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review... Signed-off-by: Brad Davidson <[email protected]>

Add dial duration to debug error message

0c02a65

This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future. Signed-off-by: Brad Davidson <[email protected]>

brandond force-pushed the fix-loadbalancer-reentrant-rlock branch from b669642 to 0c02a65 Compare July 14, 2024 00:57

dereknola approved these changes Jul 14, 2024

View reviewed changes

This was referenced Jul 15, 2024

[release-1.27] Backports for 2024-07 release cycle #10500

Merged

[release-1.28] Backports for 2024-07 release cycle #10499

Merged

[release-1.29] Backports for 2024-07 release cycle #10498

Merged

[release-1.30] Backports for 2024-07 release cycle #10497

Merged

vitorsavian approved these changes Jul 15, 2024

View reviewed changes

brandond merged commit cb6bf74 into k3s-io:master Jul 15, 2024
29 checks passed

brandond mentioned this pull request Jul 15, 2024

ipv6 documentation rancher/rke2-docs#235

Open

cguertin14 mentioned this pull request Sep 3, 2024

new release: k3s update from v1.30.4+k3s1 to v1.31.0+k3s1 cguertin14/k3s-ansible-ha#41

Merged

starbops mentioned this pull request Sep 3, 2024

[BUG] v1.3.1 -> v1.3.2-rc2 upgrade fail on 2 nodes clusters with customized default storage class and VM harvester/harvester#6432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loadbalancer reentrant rlock #10511

Fix loadbalancer reentrant rlock #10511

brandond commented Jul 12, 2024 •

edited

Loading

dereknola commented Jul 12, 2024

codecov bot commented Jul 12, 2024 •

edited

Loading

Fix loadbalancer reentrant rlock #10511

Fix loadbalancer reentrant rlock #10511

Conversation

brandond commented Jul 12, 2024 • edited Loading

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

dereknola commented Jul 12, 2024

codecov bot commented Jul 12, 2024 • edited Loading

Codecov Report

brandond commented Jul 12, 2024 •

edited

Loading

codecov bot commented Jul 12, 2024 •

edited

Loading