Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically resolve reverse tunnel address #9958

Merged
merged 4 commits into from
Feb 3, 2022
Merged

Conversation

rosstimothy
Copy link
Contributor

@rosstimothy rosstimothy commented Jan 26, 2022

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the tunnel_public_address is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a reversetunnel.Resolver. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.

@rosstimothy rosstimothy force-pushed the tross/tunnel_resolver branch 2 times, most recently from 59f1017 to fa00cc9 Compare January 26, 2022 19:23
@rosstimothy rosstimothy force-pushed the tross/tunnel_resolver branch 2 times, most recently from fabafda to df3a741 Compare January 27, 2022 14:45
@rosstimothy rosstimothy changed the title use tunnel resolver to resolve proxy tunnel instead of static string Dynamically resolve reverse tunnel address Jan 27, 2022
@rosstimothy rosstimothy force-pushed the tross/tunnel_resolver branch 2 times, most recently from 204300c to b3625c0 Compare January 27, 2022 21:48
@rosstimothy rosstimothy marked this pull request as ready for review February 1, 2022 14:05
@github-actions github-actions bot added the tctl tctl - Teleport admin tool label Feb 1, 2022
@github-actions github-actions bot requested a review from quinqu February 1, 2022 14:06
@rosstimothy rosstimothy added the robustness Resistance to crashes and reliability label Feb 1, 2022
@russjones russjones requested review from Joerger and removed request for quinqu February 2, 2022 18:07
Copy link
Contributor

@Joerger Joerger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor comments

lib/reversetunnel/agentpool.go Outdated Show resolved Hide resolved
lib/reversetunnel/rc_manager_test.go Outdated Show resolved Hide resolved
lib/reversetunnel/resolver.go Outdated Show resolved Hide resolved
t.wp.Set(addr, uint64(count))
if t.sets.expire(cutoff) > 0 {
count := len(t.sets.proxies)
if count < 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set count to 1 here? Is it obvious and I'm missing it, or could we add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My educated guess is that setting the workpool target to anything < 1 would cause it to close the underlying workgroup and thus require a new one to be created in the future when the target becomes >= 1. By instead setting the workpool target to 1 it resets the workgroup without deleting it.

@fspmarshall it looks like it has been like this since you added the tracker can you please correct me if I am wrong here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count is the number of proxies we expect to discover, based on the heartbeats we've seen. We don't start off knowing about any proxies since it is the first proxy that we connect to that tells us about its peers. Therefore if we don't know about any proxies yet, we just try to find at least one, and then wait for it to tell us the real number.

lib/service/service.go Outdated Show resolved Hide resolved
The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
- rename ResolveViwWebClient to WebClientResolver
- add singleProcessModeResolver to TeleportProcess
- make AgentPool.filterAndClose return nil if there are no matches
- rename Pool.groups to Pool.group
- remove Lease from TrackExpected
@rosstimothy rosstimothy force-pushed the tross/tunnel_resolver branch from b3625c0 to 6356177 Compare February 2, 2022 21:28
lib/reversetunnel/agent.go Outdated Show resolved Hide resolved
lib/reversetunnel/agentpool.go Outdated Show resolved Hide resolved
lib/reversetunnel/rc_manager.go Outdated Show resolved Hide resolved
Comment on lines +26 to +27
// Resolver looks up reverse tunnel addresses
type Resolver func() (*utils.NetAddr, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the increased network load if proxies are restarted in very large clusters. Might create another thundering herd problem if we reload the address on every call to the resolver. I'm thinking it might be good if the resolver had some basic ttl-based caching capabilities, so that if it were called many times over a small period (say 3-5 seconds), only one network call is actually made.

We already use lib/cache/fncache.go to do the same thing to reduce thundering herd issues when the cache is unhealthy. That helper doesn't rely on anything in lib/cache, so we could easily move it somewhere under utils, and use it to provide similar functionality here. Ex:

func CachingResolver(resolver Resolver) Resolver {
    cache := utils.NewFnCache(3 * time.Second)
    return func() (*utils.NetAddr, error) {
        a, err := cache.Get(context.TODO(), "resolver", func() (interface{}, error) {
            addr, err := resolver()
            return addr, err
        })
        if err != nil {
            return nil, err
        }
        return a.(*utils.NetAddr), nil
    }
}

@rosstimothy rosstimothy enabled auto-merge (squash) February 3, 2022 16:19
@rosstimothy rosstimothy merged commit 6cb1371 into master Feb 3, 2022
@rosstimothy rosstimothy deleted the tross/tunnel_resolver branch February 3, 2022 16:24
rosstimothy added a commit that referenced this pull request Feb 4, 2022
* Dynamically resolve reverse tunnel address

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.

(cherry picked from commit 6cb1371)
@webvictim webvictim mentioned this pull request Mar 4, 2022
@bernardjkim
Copy link
Contributor

Hi @rosstimothy do we plan on back porting this into v7?

In certain situations v7 agents will fail to reconnect after a proxy config change. More details here https://github.com/gravitational/cloud/issues/1441#issuecomment-1067391237. If we don't plan on back porting this change into v7 I can create a new issue to patch up the current implementation.

rosstimothy added a commit that referenced this pull request Mar 16, 2022
* Dynamically resolve reverse tunnel address

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinintely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applid.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accomodate using a dynamic
reverse tunnel address.
rosstimothy added a commit that referenced this pull request Mar 21, 2022
* Dynamically resolve reverse tunnel address (#9958)

The reverse tunnel address is currently a static string that is
retrieved from config and passed around for the duration of a
services lifetime. When the `tunnel_public_address` is changed
on the proxy and the proxy is then restarted, all established
reverse tunnels over the old address will fail indefinitely.
As a means to get around this, #8102 introduced a mechanism
that would cause nodes to restart if their connection to the
auth server was down for a period of time. While this did
allow the nodes to pickup the new address after the nodes
restarted it was meant to be a stop gap until a more robust
solution could be applied.

Instead of using a static address, the reverse tunnel address
is now resolved via a `reversetunnel.Resolver`. Anywhere that
previoulsy relied on the static proxy address now will fetch
the actual reverse tunnel address via the webclient by using
the Resolver. In addition this builds on the refactoring done
in #4290 to further simplify the reversetunnel package. Since
we no longer track multiple proxies, all the left over bits
that did so have been removed to accommodate using a dynamic
reverse tunnel address.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robustness Resistance to crashes and reliability tctl tctl - Teleport admin tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants