Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlproxyccl: rework sqlproxy connection throttler #71412

Merged
merged 1 commit into from
Oct 13, 2021

Conversation

jeffswenson
Copy link
Collaborator

This change switches the sqlproxy connection throttling logic back to
exponential backoff. The tokenbucket approach was introduced by
PR #69041. There are a few behavior differences between this and the
original exponential backoff implementation.

  1. The throttling logic is maintained per (ip, tenant) instead of per
    (ip). Some platform as a service provides share a single outbound ip
    address between multiple clients. These users would occasionaly see
    throttling caused by a second user sharing their IP.
  2. The throttling logic was triggered before there was an authentication
    failure. It takes ~100ms-1000ms to authenticate with the tenant
    process. Any requests that arrived after the first request, but
    before it was processed, would trigger the throttle. Now, we only
    trigger the throttle in response to an explict authorization error.

Release note: None

@jeffswenson jeffswenson requested review from a team as code owners October 11, 2021 18:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@kpatron-cockroachlabs kpatron-cockroachlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 17 of 17 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp)

@jeffswenson jeffswenson force-pushed the jeffswenson-throttle-v3 branch from 4e3bd3f to 55ba27e Compare October 11, 2021 22:20
@andy-kimball
Copy link
Contributor

@darinpp, can you take a look as well?

Copy link
Contributor

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @kpatron-cockroachlabs)


pkg/ccl/sqlproxyccl/authentication.go, line 35 at r2 (raw file):

			return newErrorf(codeBackendReadFailed, "unable to receive message from backend: %v", err)
		}

I don't think we should be reporting throttle errors as such. This gives an attacker information that we don't need to give back. Reporting authentication error when there is one but also when they are throttled prevents them from knowing that password they tried is incorrect. This also eliminates the need to have throttleHook in authenticate so this code is not needed.


pkg/ccl/sqlproxyccl/proxy_handler.go, line 266 at r2 (raw file):

	throttleTime, err := handler.throttleService.LoginCheck(throttleTags)
	if err != nil {
		log.Errorf(ctx, "throttler refused connection: %v", err.Error())

This should be authentication error. Reporting that they are throttled let's an attacked know that the password may be correct and to try later. Making all throttler errors looks like authentication errors, hides the actual length of the throttle period and prevents the attackers to get reliable feedback on incorrect passwords.


pkg/ccl/sqlproxyccl/proxy_handler.go, line 402 at r2 (raw file):

	// Perform user authentication.
	if err := authenticate(conn, crdbConn, func(status throttler.AttemptStatus) *pgproto3.ErrorResponse {

If all errors are authenticate errors then we don't need this hook.


pkg/ccl/sqlproxyccl/throttler/local.go, line 61 at r2 (raw file):

		maxCacheSize: 1e6, /* 1 million */
		baseDelay:    time.Second,
		maxDelay:     time.Hour,

I don't think one hours works. If I run workload init tpcc for example with 36 connections but I make a mistake in the password. That will activate the throttler and it will make me wait an hour. We should probably have more like 10,15 or 30 seconds. Another issue with one hour (or any long period in general) is if the connections don't get spread equally across sqlproxies. Imagine that 16 connections get spread over 3 proxies as 4,5,7. So one proxy will block for 16 sec while another will block for 2+min. So if I try again my workload in 20 sec, I will get 1/3 of my connections fine and 2/3 throttled. There isn't much we can do about it now but that is one more reason why we shouldn't have a very high max delay as this can lead to bad experience due to a small mistake.


pkg/ccl/sqlproxyccl/throttler/local.go, line 103 at r2 (raw file):

func (s *localService) ReportAttempt(
	connection ConnectionTags, throttleTime time.Time, status AttemptStatus,

If the only two status choices are OK and Credentials it may be simpler to not have the status and instead have ReportSuccess and ReportFailure methods.


pkg/ccl/sqlproxyccl/throttler/throttle.go, line 35 at r2 (raw file):

func (l *throttle) triggerThrottle(now time.Time, maxBackoff time.Duration) {
	l.nextTime = now.Add(l.nextBackoff)
	l.nextBackoff *= 2

Do we ever reset that to initialBackoff? If I run workload init tpcc with say 48 connections and get locked for one hour and then in one hour I try cockroach sql making a mistake in my password - am I going to be locked for one more hour? My preference would be to keep it as is (not reset it) but have something like 10 sec maximum backoff.

Copy link
Collaborator Author

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @kpatron-cockroachlabs)


pkg/ccl/sqlproxyccl/authentication.go, line 35 at r2 (raw file):

Previously, darinpp wrote…

I don't think we should be reporting throttle errors as such. This gives an attacker information that we don't need to give back. Reporting authentication error when there is one but also when they are throttled prevents them from knowing that password they tried is incorrect. This also eliminates the need to have throttleHook in authenticate so this code is not needed.

The throttle hook is here to prevent leaking authentication state. authenticate does some frame level proxying back to the client. If we didn't check the throttle here, it would forward an AuthenticationOK or ErrorResponse to the client.


pkg/ccl/sqlproxyccl/proxy_handler.go, line 266 at r2 (raw file):

error. Reporting that they are throttled let's an attacked know that the password may be correct and to try later. Making all th

This is not a behavior change. It is the same throttle error we returned in previous iterations of the throttle.

I think it would be a bad user experience to report these as authentication errors. If a user misconfigures their password, then fixes it, we should show an error message indicating we didn't actually check their password.

Even if we returned an authentication error here, it would still be obvious to an attacker that we didn't actually check the credentials. The sql server responds with a AuthenticationCleartextPassword message prompting the client to send its password. Throttling before we try to dial the backend means we would send an authentication error before the client has sent us its credentials.


pkg/ccl/sqlproxyccl/proxy_handler.go, line 402 at r2 (raw file):

Previously, darinpp wrote…

If all errors are authenticate errors then we don't need this hook.

The hook would still be necessary to consult throttle state and suppress a potential AuthenticationOK response from the sql server.


pkg/ccl/sqlproxyccl/throttler/local.go, line 61 at r2 (raw file):

suppress
The exponential backoff doesn't trigger if the request was throttled. Only attempts let past the throttle trigger an exponential backoff. So the total time period of authentication errors is going to be at most double the time your script was retrying with a bad password.


pkg/ccl/sqlproxyccl/throttler/local.go, line 103 at r2 (raw file):

Previously, darinpp wrote…

If the only two status choices are OK and Credentials it may be simpler to not have the status and instead have ReportSuccess and ReportFailure methods.

I tried splitting them. They share all but one line of code and having a function handle both status allows us to pass a single throttle hook function to authenticate.


pkg/ccl/sqlproxyccl/throttler/throttle.go, line 35 at r2 (raw file):

Previously, darinpp wrote…

Do we ever reset that to initialBackoff? If I run workload init tpcc with say 48 connections and get locked for one hour and then in one hour I try cockroach sql making a mistake in my password - am I going to be locked for one more hour? My preference would be to keep it as is (not reset it) but have something like 10 sec maximum backoff.

I chose a 1 hour backoff because that is the value used by the original throttle logic. In general you won't get locked out for an hour unless you leave the script retrying for at least 30 minutes.

Copy link
Contributor

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @kpatron-cockroachlabs)


pkg/ccl/sqlproxyccl/authentication.go, line 35 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The throttle hook is here to prevent leaking authentication state. authenticate does some frame level proxying back to the client. If we didn't check the throttle here, it would forward an AuthenticationOK or ErrorResponse to the client.

OK. Sounds good. Can you still merge this switch into the on below by moving the fe.Send(backendMsg) into the switch?


pkg/ccl/sqlproxyccl/proxy_handler.go, line 266 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

error. Reporting that they are throttled let's an attacked know that the password may be correct and to try later. Making all th

This is not a behavior change. It is the same throttle error we returned in previous iterations of the throttle.

I think it would be a bad user experience to report these as authentication errors. If a user misconfigures their password, then fixes it, we should show an error message indicating we didn't actually check their password.

Even if we returned an authentication error here, it would still be obvious to an attacker that we didn't actually check the credentials. The sql server responds with a AuthenticationCleartextPassword message prompting the client to send its password. Throttling before we try to dial the backend means we would send an authentication error before the client has sent us its credentials.

Sounds good.


pkg/ccl/sqlproxyccl/proxy_handler.go, line 402 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The hook would still be necessary to consult throttle state and suppress a potential AuthenticationOK response from the sql server.

OK


pkg/ccl/sqlproxyccl/throttler/local.go, line 61 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

suppress
The exponential backoff doesn't trigger if the request was throttled. Only attempts let past the throttle trigger an exponential backoff. So the total time period of authentication errors is going to be at most double the time your script was retrying with a bad password.

I see. So a quick burst of connection requests with invalid passwords won't cause a problem but a persistent stream of invalid password connections over time will push the backoff time up.


pkg/ccl/sqlproxyccl/throttler/throttle.go, line 35 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I chose a 1 hour backoff because that is the value used by the original throttle logic. In general you won't get locked out for an hour unless you leave the script retrying for at least 30 minutes.

I see. It may be helpful to add a line to the log when the backoff gets to the one hour mark for a given tenant/IP. It should be very rare but may be a good signal for security.

@jeffswenson jeffswenson force-pushed the jeffswenson-throttle-v3 branch 3 times, most recently from 3ec03d5 to 0c3a32b Compare October 12, 2021 20:28
Copy link
Collaborator Author

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @jeffswenson, and @kpatron-cockroachlabs)


pkg/ccl/sqlproxyccl/authentication.go, line 35 at r2 (raw file):

Previously, darinpp wrote…

OK. Sounds good. Can you still merge this switch into the on below by moving the fe.Send(backendMsg) into the switch?

Done. PTAL the refactor was non-trivial. I made one small behavior change. If there is an unexpected message type during authentication, we used to forward the message, then immediately close the connection. Now we close the connection without forwarding the unexpected message.


pkg/ccl/sqlproxyccl/throttler/throttle.go, line 35 at r2 (raw file):

Previously, darinpp wrote…

I see. It may be helpful to add a line to the log when the backoff gets to the one hour mark for a given tenant/IP. It should be very rare but may be a good signal for security.

Done.

@jeffswenson jeffswenson force-pushed the jeffswenson-throttle-v3 branch from 0c3a32b to 0cd0cc1 Compare October 12, 2021 21:18
This change switches the sqlproxy connection throttling logic back to
exponential backoff. The tokenbucket approach was introduced by
PR cockroachdb#69041. There are a few behavior differences between this and the
original exponential backoff implementation.

1. The throttling logic is maintained per (ip, tenant) instead of per
   (ip). Some platform as a service provides share a single outbound ip
   address between multiple clients. These users would occasionaly see
   throttling caused by a second user sharing their IP.
2. The throttling logic was triggered before there was an authentication
   failure. It takes ~100ms-1000ms to authenticate with the tenant
   process.  Any requests that arrived after the first request, but
   before it was processed, would trigger the throttle. Now, we only
   trigger the throttle in response to an explict authorization error.

Release note: None
@jeffswenson jeffswenson force-pushed the jeffswenson-throttle-v3 branch from 0cd0cc1 to 2559b4b Compare October 13, 2021 14:16
Copy link
Contributor

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @jeffswenson, and @kpatron-cockroachlabs)

@jeffswenson
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 13, 2021

Build failed:

@jeffswenson
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 13, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sqlproxyccl: rework throttle to track ip,tenant pairs
5 participants