Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart entire node on tunnel collapse #8102

Merged
merged 25 commits into from
Nov 17, 2021
Merged

Conversation

tcsc
Copy link
Contributor

@tcsc tcsc commented Aug 31, 2021

Imagine you have a cluster with the a node connected in via a tunnel through a proxy example.com on port 3024

Now imagine you change the proxy config so that tunnel_public_address is example.com:4024. You either restart the proxy, or reload the proxy config with a SIGHUP.

...and the node doesn't reconnect to the proxy, because even though the auth_server address hasn't changed the node has cached the old tunnel_public_address and keeps trying to connect to that.

You can always manually restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.

In order to not have to manually restart all nodes, this change implements a check for a connection failures to the auth server, and re-starts the node if there are multiple connection failures in a given period of time. The check as-implemented piggybacks on the node's "common.rotate" service, which can already restart the node in certain circumstances, and uses the success of the periodic rotation sync as a proxy for the health of the node's connection to the auth server.

@tcsc
Copy link
Contributor Author

tcsc commented Aug 31, 2021

Before merge: needs tests, buy-in from owners/experts, delete debug printf()s.

@tcsc tcsc linked an issue Aug 31, 2021 that may be closed by this pull request
@tcsc tcsc mentioned this pull request Sep 13, 2021
@tcsc tcsc marked this pull request as ready for review September 13, 2021 00:46
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable and I like how you took a somewhat generic approach, so it may address problems we didn't foresee. I have a slight concern that something like this could cause nodes to exist in a perpetual restart cycle, but I'm probably being paranoid - seems better to merge this than not.

Requesting changes mainly due to test coverage.

lib/reversetunnel/agentpool.go Outdated Show resolved Hide resolved
lib/reversetunnel/agentpool.go Outdated Show resolved Hide resolved
lib/reversetunnel/agentpool.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Show resolved Hide resolved
lib/service/connect.go Outdated Show resolved Hide resolved
@tcsc tcsc marked this pull request as draft September 22, 2021 02:10
@russjones
Copy link
Contributor

@fspmarshall @knisbet What do you two think about this approach?

@xacrimon
Copy link
Contributor

@fspmarshall @knisbet What do you two think about this approach?

If we have to restart the entire node (or close to it) then this is a good solution in my eyes. Though I wonder if that is completely necessary? If that would involves significant effort though this seems like a generic "should work well enough" passable solution.

@knisbet
Copy link
Contributor

knisbet commented Sep 29, 2021

@fspmarshall @knisbet What do you two think about this approach?

I think the only part I'm a bit curious about is how different supervisors would react. I'm not sure how the teleport container is setup, but it might not be able to graceful restart when the parent pid exits if it's running as pid 1. But as long as that happens should be fine.

I suspect most supervisors teleport runs under should be able to handle a consistently restarting process, if the grace periods are long enough, things like systemd and kubernetes will backoff. Although my guess is the Teleport team would be far more familiar with what supervisors are commonly used with Teleport.

Copy link
Contributor

@nklaassen nklaassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @codingllama's comments, otherwise looks good to me

lib/utils/timed_counter.go Outdated Show resolved Hide resolved
lib/utils/timed_counter.go Show resolved Hide resolved
lib/service/connect.go Show resolved Hide resolved
rosstimothy added a commit that referenced this pull request Nov 24, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 2, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 7, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 8, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 15, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 16, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~2min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Dec 16, 2021
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Jan 20, 2022
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Jan 20, 2022
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.

Fixes #6889.
rosstimothy added a commit that referenced this pull request Jan 27, 2022
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.
rosstimothy added a commit that referenced this pull request Jan 27, 2022
Move cache and resourceWatcher watchers from a 10s retry to a jittered backoff retry up to ~1min. Replace the
reconnectToAuthService interval with a retry to add jitter and backoff there as well for when a node restarts due to
changes introduced in #8102.
rosstimothy added a commit that referenced this pull request Jan 27, 2022
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Feb 2, 2022
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Feb 2, 2022
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Feb 3, 2022
* Dynamically resolve reverse tunnel address

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Feb 4, 2022
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.

(cherry picked from commit 6cb1371)
rosstimothy added a commit that referenced this pull request Feb 4, 2022
* Dynamically resolve reverse tunnel address

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.

(cherry picked from commit 6cb1371)
@webvictim webvictim mentioned this pull request Mar 4, 2022
rosstimothy added a commit that referenced this pull request Mar 16, 2022
* Dynamically resolve reverse tunnel address

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Mar 21, 2022
* Dynamically resolve reverse tunnel address (#9958)

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinitely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applied.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accommodate using a dynamic
reverse tunnel address.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Handle changes to tunnel_public_addr
6 participants