Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

Merged
merged 2 commits into from
Mar 18, 2015

Conversation

klizhentas
Copy link
Contributor

Purpose

To validate the idea of Medusa design (see README for details on the design) we need to implement a SSH server that supports features requested in #2

  • Dynamic Etcd backend
  • Agent forwarding
  • Structured logging
  • Multiplexing and auto discovery
  • CA authority and key management

Implementation

Note: PR is not ready for review yet, created mostly for viewing/commenting purposes.

@klizhentas klizhentas force-pushed the alexander/proto branch 3 times, most recently from d7be396 to 5fe7f2e Compare March 5, 2015 19:06
klizhentas added a commit that referenced this pull request Mar 18, 2015
Implement a prototype for a proxying SSH server that implements concepts expressed in readme
@klizhentas klizhentas merged commit c4815e6 into master Mar 18, 2015
@klizhentas klizhentas deleted the alexander/proto branch January 21, 2016 17:58
brumfb pushed a commit to brumfb/teleport that referenced this pull request Dec 14, 2016
[WIP] iterate over oidc providers
@jicowan jicowan mentioned this pull request May 25, 2019
awly pushed a commit that referenced this pull request Jul 17, 2020
When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to
(maybe) start port forwarding and a second time to execute `cmd`.
Instead, reuse the first connection to run `cmd` and only fall back to
re-dialing if we're matching multiple nodes by label.

This gives ~20-30% speedup for non-interactive commands (useful for
tools like ansible):

```
> hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true'
Benchmark #1: tsh ssh localhost true
  Time (mean ± σ):      65.5 ms ±   5.0 ms    [User: 12.9 ms, System: 6.1 ms]
  Range (min … max):    57.0 ms …  74.2 ms    41 runs

Benchmark #2: ~/src/teleport/build/tsh ssh localhost true
  Time (mean ± σ):      51.7 ms ±   3.2 ms    [User: 9.0 ms, System: 5.0 ms]
  Range (min … max):    48.5 ms …  68.5 ms    57 runs

Summary
  '~/src/teleport/build/tsh ssh localhost true' ran
    1.27 ± 0.12 times faster than 'tsh ssh localhost true'
```
awly pushed a commit that referenced this pull request Jul 20, 2020
When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to
(maybe) start port forwarding and a second time to execute `cmd`.
Instead, reuse the first connection to run `cmd` and only fall back to
re-dialing if we're matching multiple nodes by label.

This gives ~20-30% speedup for non-interactive commands (useful for
tools like ansible):

```
> hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true'
Benchmark #1: tsh ssh localhost true
  Time (mean ± σ):      65.5 ms ±   5.0 ms    [User: 12.9 ms, System: 6.1 ms]
  Range (min … max):    57.0 ms …  74.2 ms    41 runs

Benchmark #2: ~/src/teleport/build/tsh ssh localhost true
  Time (mean ± σ):      51.7 ms ±   3.2 ms    [User: 9.0 ms, System: 5.0 ms]
  Range (min … max):    48.5 ms …  68.5 ms    57 runs

Summary
  '~/src/teleport/build/tsh ssh localhost true' ran
    1.27 ± 0.12 times faster than 'tsh ssh localhost true'
```
awly pushed a commit that referenced this pull request Feb 18, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Feb 22, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Feb 26, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Mar 3, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Mar 5, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Mar 10, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this pull request Mar 10, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this pull request Mar 25, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this pull request Mar 29, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this pull request Mar 29, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
hatched pushed a commit to hatched/teleport-merge that referenced this pull request Nov 30, 2022
xacrimon added a commit that referenced this pull request Dec 5, 2022
@r0mant r0mant mentioned this pull request Apr 13, 2023
jentfoo added a commit that referenced this pull request Apr 14, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.
jentfoo added a commit that referenced this pull request Apr 18, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.
jentfoo added a commit that referenced this pull request Apr 25, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
jentfoo added a commit that referenced this pull request Apr 27, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
jentfoo added a commit that referenced this pull request May 1, 2023
* Rate limit all unauthenticated HTTP endpoints

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.

* Add additional `High` Rate Limiting

This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks).
Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's.
In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM.  Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection.

* Fix for missing error check
nick-inkeep pushed a commit to nick-inkeep/teleport-docs that referenced this pull request Jun 20, 2023
…te-docs

Move docs files from the next repo
@r0mant r0mant mentioned this pull request Aug 28, 2023
zmb3 added a commit that referenced this pull request Jan 6, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
@r0mant r0mant mentioned this pull request Jan 13, 2024
@zmb3 zmb3 mentioned this pull request Feb 13, 2024
zmb3 added a commit that referenced this pull request Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-merge-queue bot pushed a commit that referenced this pull request Oct 28, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments
github-actions bot pushed a commit that referenced this pull request Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-actions bot pushed a commit that referenced this pull request Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments

* Fix some LDAP connection bugs

In #36281 we made some improvements to the LDAP reconnect behavior.
These changes considered the case where we had a connection to the
LDAP server but then got disconnected. They did not consider the case
where we never succesfully established a connection at all.
github-merge-queue bot pushed a commit that referenced this pull request Nov 18, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments

* Fix some LDAP connection bugs

In #36281 we made some improvements to the LDAP reconnect behavior.
These changes considered the case where we had a connection to the
LDAP server but then got disconnected. They did not consider the case
where we never succesfully established a connection at all.

* Fix typo

---------

Co-authored-by: Gus Luxton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant