Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

klizhentas · 2015-03-03T00:48:40Z

Purpose

To validate the idea of Medusa design (see README for details on the design) we need to implement a SSH server that supports features requested in #2

Dynamic Etcd backend
Agent forwarding
Structured logging
Multiplexing and auto discovery
CA authority and key management

Implementation

Note: PR is not ready for review yet, created mostly for viewing/commenting purposes.

Implement a prototype for a proxying SSH server that implements concepts expressed in readme

[WIP] iterate over oidc providers

When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to (maybe) start port forwarding and a second time to execute `cmd`. Instead, reuse the first connection to run `cmd` and only fall back to re-dialing if we're matching multiple nodes by label. This gives ~20-30% speedup for non-interactive commands (useful for tools like ansible): ``` > hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true' Benchmark #1: tsh ssh localhost true Time (mean ± σ): 65.5 ms ± 5.0 ms [User: 12.9 ms, System: 6.1 ms] Range (min … max): 57.0 ms … 74.2 ms 41 runs Benchmark #2: ~/src/teleport/build/tsh ssh localhost true Time (mean ± σ): 51.7 ms ± 3.2 ms [User: 9.0 ms, System: 5.0 ms] Range (min … max): 48.5 ms … 68.5 ms 57 runs Summary '~/src/teleport/build/tsh ssh localhost true' ran 1.27 ± 0.12 times faster than 'tsh ssh localhost true' ```

This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing

* mfa: per-session MFA certs for SSH and Kubernetes This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing * Capture access approvals on MFA-issued certs * Address review feedback * Address review feedback * mfa: accept unknown nodes during short-term MFA cert creation An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection. * Validate username against unmapped user identity IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller. * mfa: add IsMFARequired RPC This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation. * Fix integration tests * Correctly treat "Username" as login name in IsMFARequired Also, move the logic into auth.Server out of ServerWithRoles. * Fix TestHA * Address review feedback

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled.

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled. Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added. This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.

* Rate limit all unauthenticated HTTP endpoints This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled. Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added. This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests. * Add additional `High` Rate Limiting This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks). Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's. In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM. Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection. * Fix for missing error check

…te-docs Move docs files from the next repo

If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments

If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments * Fix some LDAP connection bugs In #36281 we made some improvements to the LDAP reconnect behavior. These changes considered the case where we had a connection to the LDAP server but then got disconnected. They did not consider the case where we never succesfully established a connection at all.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments * Fix some LDAP connection bugs In #36281 we made some improvements to the LDAP reconnect behavior. These changes considered the case where we had a connection to the LDAP server but then got disconnected. They did not consider the case where we never succesfully established a connection at all. * Fix typo --------- Co-authored-by: Gus Luxton <[email protected]>

klizhentas force-pushed the alexander/proto branch 3 times, most recently from d7be396 to 5fe7f2e Compare March 5, 2015 19:06

Initial working prototype

0c425d3

klizhentas force-pushed the alexander/proto branch from 8f8adfb to 0c425d3 Compare March 14, 2015 21:36

Vendor-in dependencies

55de39d

klizhentas added a commit that referenced this pull request Mar 18, 2015

Merge pull request #1 from gravitational/alexander/proto

c4815e6

Implement a prototype for a proxying SSH server that implements concepts expressed in readme

klizhentas merged commit c4815e6 into master Mar 18, 2015

klizhentas deleted the alexander/proto branch January 21, 2016 17:58

brumfb pushed a commit to brumfb/teleport that referenced this pull request Dec 14, 2016

Merge pull request gravitational#1 from tehsis/oidc

7f75364

[WIP] iterate over oidc providers

jicowan mentioned this pull request May 25, 2019

Handshake error #2739

Closed

webvictim mentioned this pull request Apr 27, 2021

Allow fqdn as cluster name in aws terraform #6548

Closed

gzigzigzeo mentioned this pull request Nov 14, 2022

Cloud Build workflows moved to Github Actions #18211

Closed

4 tasks

hatched pushed a commit to hatched/teleport-merge that referenced this pull request Nov 30, 2022

Docs (gravitational#1)

0b321ca

xacrimon added a commit that referenced this pull request Dec 5, 2022

Server Access Intregration #1

c1ece90

r0mant mentioned this pull request Apr 13, 2023

Teleport 13 Test Plan #24576

Closed

zmb3 mentioned this pull request May 10, 2023

Assist - Execution web endpoint #25955

Merged

nick-inkeep pushed a commit to nick-inkeep/teleport-docs that referenced this pull request Jun 20, 2023

Merge pull request gravitational#1 from gravitational/iadramelk/migra…

9321440

…te-docs Move docs files from the next repo

zmb3 mentioned this pull request Aug 18, 2023

Moderated sessions should gracefully fail in OSS #30621

Closed

r0mant mentioned this pull request Aug 28, 2023

Teleport 14 Test Plan #31122

Closed

gecube mentioned this pull request Dec 4, 2023

Provide a way to install a teleport on dedicated bare metal node of k8s #35331

Open

r0mant mentioned this pull request Jan 13, 2024

Teleport 15 Test Plan #36663

Closed

strideynet mentioned this pull request Jan 24, 2024

Improve test plan instructions for moderated sessions #37167

Merged

zmb3 mentioned this pull request Feb 13, 2024

tsh status panic #38188

Closed

greedy52 mentioned this pull request Aug 19, 2024

AWS audit event shown in Web UI is not accurate #45603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

klizhentas commented Mar 3, 2015

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

Conversation

klizhentas commented Mar 3, 2015

Purpose

Implementation