Restart entire node on tunnel collapse #8102

tcsc · 2021-08-31T06:03:57Z

Imagine you have a cluster with the a node connected in via a tunnel through a proxy example.com on port 3024

Now imagine you change the proxy config so that tunnel_public_address is example.com:4024. You either restart the proxy, or reload the proxy config with a SIGHUP.

...and the node doesn't reconnect to the proxy, because even though the auth_server address hasn't changed the node has cached the old tunnel_public_address and keeps trying to connect to that.

You can always manually restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.

In order to not have to manually restart all nodes, this change implements a check for a connection failures to the auth server, and re-starts the node if there are multiple connection failures in a given period of time. The check as-implemented piggybacks on the node's "common.rotate" service, which can already restart the node in certain circumstances, and uses the success of the periodic rotation sync as a proxy for the health of the node's connection to the auth server.

tcsc · 2021-08-31T06:04:41Z

Before merge: needs tests, buy-in from owners/experts, delete debug printf()s.

codingllama

The change seems reasonable and I like how you took a somewhat generic approach, so it may address problems we didn't foresee. I have a slight concern that something like this could cause nodes to exist in a perpetual restart cycle, but I'm probably being paranoid - seems better to merge this than not.

Requesting changes mainly due to test coverage.

lib/reversetunnel/agentpool.go

lib/utils/timed_counter.go

lib/service/connect.go

…nel-collapse

russjones · 2021-09-22T17:00:28Z

@fspmarshall @knisbet What do you two think about this approach?

xacrimon · 2021-09-22T17:25:54Z

@fspmarshall @knisbet What do you two think about this approach?

If we have to restart the entire node (or close to it) then this is a good solution in my eyes. Though I wonder if that is completely necessary? If that would involves significant effort though this seems like a generic "should work well enough" passable solution.

knisbet · 2021-09-29T17:52:03Z

@fspmarshall @knisbet What do you two think about this approach?

I think the only part I'm a bit curious about is how different supervisors would react. I'm not sure how the teleport container is setup, but it might not be able to graceful restart when the parent pid exits if it's running as pid 1. But as long as that happens should be fine.

I suspect most supervisors teleport runs under should be able to handle a consistently restarting process, if the grace periods are long enough, things like systemd and kubernetes will backoff. Although my guess is the Teleport team would be far more familiar with what supervisors are commonly used with Teleport.

nklaassen

Agree with @codingllama's comments, otherwise looks good to me

lib/utils/timed_counter.go

…nel-collapse

lib/utils/timed_counter.go

lib/service/connect.go

…nel-collapse

Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to changes introduced in #8102. Fixes #6889.

Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to changes introduced in #8102. Fixes #6889.

Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to changes introduced in #8102.

The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.

* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.

The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address. (cherry picked from commit 6cb1371)

* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address. (cherry picked from commit 6cb1371)

* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.

* Dynamically resolve reverse tunnel address (#9958) The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinitely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applied. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accommodate using a dynamic reverse tunnel address.

tcsc added 2 commits August 19, 2021 10:25

Proof of concept

94e7dec

Merge branch 'main' into tcsc/restart-on-tunnel-collapse

60eb919

tcsc linked an issue Aug 31, 2021 that may be closed by this pull request

Feature Request: Handle changes to tunnel_public_addr #7606

Closed

Merge branch 'master' into tcsc/restart-on-tunnel-collapse

0c291d7

tcsc mentioned this pull request Sep 13, 2021

WIP #7907

Closed

Merge branch 'master' into tcsc/restart-on-tunnel-collapse

5dc66d7

tcsc requested review from codingllama, nklaassen and xacrimon September 13, 2021 00:46

tcsc marked this pull request as ready for review September 13, 2021 00:46

tcsc requested review from klizhentas, r0mant and russjones as code owners September 13, 2021 00:46

Remove debugging printfs

584a977

codingllama requested changes Sep 13, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into tcsc/restart-on-tun…

1686c7a

…nel-collapse

tcsc marked this pull request as draft September 22, 2021 02:10

nklaassen reviewed Oct 4, 2021

View reviewed changes

lib/utils/timed_counter.go Outdated Show resolved Hide resolved

tcsc added 4 commits October 27, 2021 15:37

Merge remote-tracking branch 'origin/master' into tcsc/restart-on-tun…

e2f2ecc

…nel-collapse

TimedCounter tests

c182db2

Remove detritus of previous approaches

2d6e98d

Merge branch 'master' into tcsc/restart-on-tunnel-collapse

c643601

codingllama reviewed Oct 28, 2021

View reviewed changes

lib/utils/timed_counter.go Show resolved Hide resolved

lib/service/connect.go Show resolved Hide resolved

tcsc added 3 commits October 29, 2021 07:53

Merge branch 'master' into tcsc/restart-on-tunnel-collapse

cfb5797

Merge remote-tracking branch 'origin/master' into tcsc/restart-on-tun…

9677bb1

…nel-collapse

Adds integration test

dff1de5

rosstimothy mentioned this pull request Nov 24, 2021

Add jitter and backoff to prevent thundering herd on auth #9133

Merged

rosstimothy mentioned this pull request Jan 27, 2022

Dynamically resolve reverse tunnel address #9958

Merged

rosstimothy mentioned this pull request Feb 3, 2022

backport #9958 to branch/v8 #10139

Merged

webvictim mentioned this pull request Mar 4, 2022

Opened in error #10856

Closed

alexatcanva mentioned this pull request Oct 13, 2022

BUGFIX | Fix Teleport ALPN Proxy not being HTTP CONNECT Proxy Aware alexatcanva/teleport#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart entire node on tunnel collapse #8102

Restart entire node on tunnel collapse #8102

tcsc commented Aug 31, 2021 •

edited

Loading

tcsc commented Aug 31, 2021 •

edited

Loading

codingllama left a comment

russjones commented Sep 22, 2021

xacrimon commented Sep 22, 2021

knisbet commented Sep 29, 2021

nklaassen left a comment

Restart entire node on tunnel collapse #8102

Restart entire node on tunnel collapse #8102

Conversation

tcsc commented Aug 31, 2021 • edited Loading

tcsc commented Aug 31, 2021 • edited Loading

codingllama left a comment

Choose a reason for hiding this comment

russjones commented Sep 22, 2021

xacrimon commented Sep 22, 2021

knisbet commented Sep 29, 2021

nklaassen left a comment

Choose a reason for hiding this comment

tcsc commented Aug 31, 2021 •

edited

Loading

tcsc commented Aug 31, 2021 •

edited

Loading