Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlproxyccl: Add support for limiting non root connections #99056

Conversation

alyshanjahani-crl
Copy link
Collaborator

This commit adds a cli flag --limit-non-root-conns to the sqlproxy. When set to true, this enables the sqlproxy to limit the number of non root connections to tenants. This is achieved by modifying the Directory interface to supply an additional field MaxNonRootConnections when getting tenants. And adding a watcher for tenants. The sqlproxy is then able to check if the current connection exceeds the MaxNonRootConnections limit for the tenant and can close it appropriately with a user friendly error.

This functionality is required for Serverless.
Part of: https://cockroachlabs.atlassian.net/browse/CC-9288

Release note: None

@alyshanjahani-crl alyshanjahani-crl requested review from a team as code owners March 20, 2023 20:15
@blathers-crl
Copy link

blathers-crl bot commented Mar 20, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@alyshanjahani-crl alyshanjahani-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.

I'm putting this PR mainly to get feedback on this implementation, and for others to see what a potential implementation in the sqlproxy may look like.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach LGTM. The only concern that I have is potentially adding all the modified tenants into memory even if we don't need it.

It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.

What other alternatives do you have?

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl)


pkg/ccl/sqlproxyccl/authentication.go line 148 at r1 (raw file):

		// serve queries.
		case *pgproto3.ReadyForQuery:
			limitError := limitHook(crdbConn)

nit: would be nice to include the reasoning behind putting the limit hook here (before sending ReadyForQuery) vs after.


pkg/ccl/sqlproxyccl/proxy_handler.go line 561 at r1 (raw file):

		ctx,
		pgServerFeConn,
		io.Discard,

nit: I think we can reuse &errWriter{} here. That way, if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.


pkg/ccl/sqlproxyccl/tenant/directory.proto line 97 at r1 (raw file):

}

message Tenant {

nit: comments describing the Tenant message; same to all the other fields


pkg/ccl/sqlproxyccl/tenant/directory.proto line 143 at r1 (raw file):

  // WatchPods is first called, it returns notifications for all existing pods.
  rpc WatchPods(WatchPodsRequest) returns (stream WatchPodsResponse);
  rpc WatchTenants(WatchTenantsRequest) returns (stream WatchTenantsResponse);

nit: comments are missing


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 247 at r1 (raw file):

func (d *directoryCache) LookupTenant(ctx context.Context,

nit: This function is exported; need comments. See LookupTenantPods.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 253 at r1 (raw file):

	if err != nil {
		return Tenant{}, err
	}

Two questions:

  1. Can LookupTenant only take in a tenantID? Given that we have already validated the clusterName earlier (in order to connect to the cluster), I don't see a reason why we need to match on cluster name again here.
  2. If (1) is valid, I think we can just call d.getEntry here with allowCreate=false.

Code quote:

	tenantID roachpb.TenantID, clusterName string) (Tenant, error) {
	// Ensure that a directory entry has been created for this tenant.
	entry, err := d.ensureDirectoryEntry(ctx, tenantID, clusterName)
	if err != nil {
		return Tenant{}, err
	}

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 255 at r1 (raw file):

	}

	return Tenant{

I'd recommend adding a ToProto() on the tenantEntry that returns this struct.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 486 at r1 (raw file):

// sends modified events. Deleted events are of no interest since a deleted tenant would
// have no backend for the sqlproxy to connect to. Created events are of no interest since
// a tenant entry will get initialized in getEntry if it does not exist.

Regardless of implementation, this code should be able to handle all cases.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 529 at r1 (raw file):

			// Update the directory entry for the tenant with the latest
			// information about this tenant.
			d.updateTenantEntryWithTenant(ctx, resp.Tenant)

I think we should check resp.Tenant != nil here.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 584 at r1 (raw file):

func (d *directoryCache) updateTenantEntryWithTenant(ctx context.Context, tenant *Tenant) {
	// Ensure that a directory entry exists for this tenant.
	entry, err := d.getEntry(ctx, roachpb.MustMakeTenantID(tenant.TenantID), true /* allowCreate */)

If an entry does not exist in the directory cache, should we just ignore? Given that this sends modified events, there's a possibility where all the tenants in the host get modified, and we don't want to be creating entries for all of them here, especially when once created, they are not GC'ed. I'd imagine this should only care about tenants that are already in the cache. What do you think?


pkg/ccl/sqlproxyccl/tenant/entry.go line 96 at r1 (raw file):

		// we can take ClusterName from that.
		e.ClusterName = tenantResp.ClusterName
		e.SetMaxNonRootConnections(tenantResp.Tenant.MaxNonRootConns)

What if tenantResp.Tenant is nil?


pkg/ccl/sqlproxyccl/tenant/entry.go line 146 at r1 (raw file):

}

func (e *tenantEntry) SetMaxNonRootConnections(numNonRoot int32){

nit: These exported methods need comment.


pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 231 at r1 (raw file):

}

func (s *TestDirectoryServer) WatchTenants(

nit: comments


pkg/ccl/sqlproxyccl/tenantdirsvr/test_simple_directory_svr.go line 82 at r1 (raw file):

// WatchPods is a no-op for the simple directory.
//
// WatchPods implements the tenant.DirectoryServer interface.

s/WatchPods/WatchTenants

@@ -162,11 +167,20 @@ const throttledErrorHint string = `Connection throttling is triggered by repeate
sure the username and password are correct.
`

const resourceLimitedErrorHint string = `Connection limiting is triggered by insufficient resource limits. Make
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure the error message is as useful as possible. Ideally this would say something like: cluster is throttled to one connection: <cluster_name> is out of storage.

If we want to keep the sqlproxy part generic, which makes sense to me, then we should let the tenant directory pick the error message.

// indicate that only root connections can be made to the tenant.
// This field can be updated over time, so a lock must be obtained before
// accessing it.
MaxNonRootConnections struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should merge MaxNonRootConnections and pods into a single mutex protected structure. I prefer course grain locking to fine grain locking. fine grain locking seems nice on the surface, but it makes it easier to introduce race conditions and deadlocks when the code is modified in the future.

struct mu {
   syncutil.Mutex
   maxActiveConnections int32
   pods []*Pod
}

// WatchTenants allows callers to monitor for pod update events.
//
// WatchTenants implements the tenant.DirectoryServer interface.
func (d *TestStaticDirectoryServer) WatchTenants(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: I wish we could simplify the sqlproxy -> tenantdir API. In an ideal world there would be two functions:

  1. StartTenant(tenant): starts a pod for the tenant if none are ready.
  2. WatchTenants() -> ([]Tenant, func(delta Tenant)): list the tenants and wait for changes. The entire tenant is updated any time the crdbtenant changes or one of its pod changes.

I don't think we need to rework this now, but it is something to consider as the API gets more complicated.

@@ -106,6 +108,9 @@ type ProxyOptions struct {
ThrottleBaseDelay time.Duration
// DisableConnectionRebalancing disables connection rebalancing for tenants.
DisableConnectionRebalancing bool
// EnableLimitNonRootConns enables the limiting of non root connections for
// tenants who have that limit set on them.
EnableLimitNonRootConns bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider generalizing the name of this feature to something like enable-active-connection-limit. We currently exclude the root user, but we may want to exclude more users in the future. For example: we may automatically create a cockroach-cloud-ui user and we would also want to exclude that from the connection limits.

}

// If the tenant has a limit specified, count the non-root connections.
if tenantResp.MaxNonRootConns != -1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?

This commit adds a cli flag --limit-non-root-conns to the sqlproxy.
When set to true, this enables the sqlproxy to limit the number of non root
connections to tenants. This is achieved by modifying the Directory interface
to supply an additional field MaxNonRootConnections when getting tenants. And
adding a watcher for tenants. The sqlproxy is then able to check if the current
connection exceeds the MaxNonRootConnections limit for the tenant and can close
it appropriately with a user friendly error.

This functionality is required for Serverless.
Part of: https://cockroachlabs.atlassian.net/browse/CC-9288

Release note: None
Copy link
Collaborator Author

@alyshanjahani-crl alyshanjahani-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @jeffswenson)


pkg/ccl/sqlproxyccl/proxy_handler.go line 533 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?

LookupTenant returns the generated protobuf type Tenant . And protobuf fields can't be nullable AFAIK.

I suppose i could create a different type, but i don't think thats worth it.


pkg/ccl/sqlproxyccl/proxy_handler.go line 561 at r1 (raw file):

if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.

I'm not following. If the server sends something unexpected, wouldn't the subsequent steps in counting non root connections (expectDataRow expectCommandComplete expectReadyForQuery ) fail - and subsequently we'd just send back to the client internal server error .

Why would the connection go through?


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 253 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

Two questions:

  1. Can LookupTenant only take in a tenantID? Given that we have already validated the clusterName earlier (in order to connect to the cluster), I don't see a reason why we need to match on cluster name again here.
  2. If (1) is valid, I think we can just call d.getEntry here with allowCreate=false.

Yes however, i don't want to bake in the assumption that LookupTenant will always be called after LookupTenantPods.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 255 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I'd recommend adding a ToProto() on the tenantEntry that returns this struct.

Done.


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 529 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I think we should check resp.Tenant != nil here.

Done in updateTenantEntryWithTenant


pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 584 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

If an entry does not exist in the directory cache, should we just ignore? Given that this sends modified events, there's a possibility where all the tenants in the host get modified, and we don't want to be creating entries for all of them here, especially when once created, they are not GC'ed. I'd imagine this should only care about tenants that are already in the cache. What do you think?

Yes, changed to set allowCreate to false.


pkg/ccl/sqlproxyccl/tenant/entry.go line 96 at r1 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

What if tenantResp.Tenant is nil?

oh i didn't realize that it could be... I thought protobuf fields always defaulted to zero values if not present.

Copy link
Collaborator Author

@alyshanjahani-crl alyshanjahani-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other alternatives do you have?

So i discussed w/ @rafiss a bit and came across this cluster setting that limits connections per gateway/node (#76401).

So another approach would be to use a similar cluster setting, see #100200

In that approach, on the Cockroach Cloud side we would limit the max sql pods for a tenant and use that new cluster setting to limit the non root connections per pod.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @jeffswenson)

@alyshanjahani-crl
Copy link
Collaborator Author

closing as we went with the cluster setting approach #100200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants