sqlproxyccl: Add support for limiting non root connections #99056

alyshanjahani-crl · 2023-03-20T20:15:40Z

This commit adds a cli flag --limit-non-root-conns to the sqlproxy. When set to true, this enables the sqlproxy to limit the number of non root connections to tenants. This is achieved by modifying the Directory interface to supply an additional field MaxNonRootConnections when getting tenants. And adding a watcher for tenants. The sqlproxy is then able to check if the current connection exceeds the MaxNonRootConnections limit for the tenant and can close it appropriately with a user friendly error.

This functionality is required for Serverless.
Part of: https://cockroachlabs.atlassian.net/browse/CC-9288

Release note: None

blathers-crl · 2023-03-20T20:15:45Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-03-20T20:15:50Z

This change is

alyshanjahani-crl

It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.

I'm putting this PR mainly to get feedback on this implementation, and for others to see what a potential implementation in the sqlproxy may look like.

Reviewable status: complete! 0 of 0 LGTMs obtained

jaylim-crl

Overall approach LGTM. The only concern that I have is potentially adding all the modified tenants into memory even if we don't need it.

It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.

What other alternatives do you have?

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl)

pkg/ccl/sqlproxyccl/authentication.go line 148 at r1 (raw file):

		// serve queries.
		case *pgproto3.ReadyForQuery:
			limitError := limitHook(crdbConn)

nit: would be nice to include the reasoning behind putting the limit hook here (before sending ReadyForQuery) vs after.

pkg/ccl/sqlproxyccl/proxy_handler.go line 561 at r1 (raw file):

		ctx,
		pgServerFeConn,
		io.Discard,

nit: I think we can reuse &errWriter{} here. That way, if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.

pkg/ccl/sqlproxyccl/tenant/directory.proto line 97 at r1 (raw file):

}

message Tenant {

nit: comments describing the Tenant message; same to all the other fields

pkg/ccl/sqlproxyccl/tenant/directory.proto line 143 at r1 (raw file):

  // WatchPods is first called, it returns notifications for all existing pods.
  rpc WatchPods(WatchPodsRequest) returns (stream WatchPodsResponse);
  rpc WatchTenants(WatchTenantsRequest) returns (stream WatchTenantsResponse);

nit: comments are missing

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 247 at r1 (raw file):

func (d *directoryCache) LookupTenant(ctx context.Context,

nit: This function is exported; need comments. See LookupTenantPods.

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 253 at r1 (raw file):

	if err != nil {
		return Tenant{}, err
	}

Two questions:

Can LookupTenant only take in a tenantID? Given that we have already validated the clusterName earlier (in order to connect to the cluster), I don't see a reason why we need to match on cluster name again here.
If (1) is valid, I think we can just call d.getEntry here with allowCreate=false.

Code quote:

	tenantID roachpb.TenantID, clusterName string) (Tenant, error) {
	// Ensure that a directory entry has been created for this tenant.
	entry, err := d.ensureDirectoryEntry(ctx, tenantID, clusterName)
	if err != nil {
		return Tenant{}, err
	}

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 255 at r1 (raw file):

	}

	return Tenant{

I'd recommend adding a ToProto() on the tenantEntry that returns this struct.

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 486 at r1 (raw file):

// sends modified events. Deleted events are of no interest since a deleted tenant would
// have no backend for the sqlproxy to connect to. Created events are of no interest since
// a tenant entry will get initialized in getEntry if it does not exist.

Regardless of implementation, this code should be able to handle all cases.

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 529 at r1 (raw file):

			// Update the directory entry for the tenant with the latest
			// information about this tenant.
			d.updateTenantEntryWithTenant(ctx, resp.Tenant)

I think we should check resp.Tenant != nil here.

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 584 at r1 (raw file):

func (d *directoryCache) updateTenantEntryWithTenant(ctx context.Context, tenant *Tenant) {
	// Ensure that a directory entry exists for this tenant.
	entry, err := d.getEntry(ctx, roachpb.MustMakeTenantID(tenant.TenantID), true /* allowCreate */)

If an entry does not exist in the directory cache, should we just ignore? Given that this sends modified events, there's a possibility where all the tenants in the host get modified, and we don't want to be creating entries for all of them here, especially when once created, they are not GC'ed. I'd imagine this should only care about tenants that are already in the cache. What do you think?

pkg/ccl/sqlproxyccl/tenant/entry.go line 96 at r1 (raw file):

		// we can take ClusterName from that.
		e.ClusterName = tenantResp.ClusterName
		e.SetMaxNonRootConnections(tenantResp.Tenant.MaxNonRootConns)

What if tenantResp.Tenant is nil?

pkg/ccl/sqlproxyccl/tenant/entry.go line 146 at r1 (raw file):

}

func (e *tenantEntry) SetMaxNonRootConnections(numNonRoot int32){

nit: These exported methods need comment.

pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 231 at r1 (raw file):

}

func (s *TestDirectoryServer) WatchTenants(

nit: comments

pkg/ccl/sqlproxyccl/tenantdirsvr/test_simple_directory_svr.go line 82 at r1 (raw file):

// WatchPods is a no-op for the simple directory.
//
// WatchPods implements the tenant.DirectoryServer interface.

s/WatchPods/WatchTenants

jeffswenson · 2023-03-21T14:53:45Z

pkg/ccl/sqlproxyccl/proxy_handler.go

@@ -162,11 +167,20 @@ const throttledErrorHint string = `Connection throttling is triggered by repeate
 sure the username and password are correct.
 `

+const resourceLimitedErrorHint string = `Connection limiting is triggered by insufficient resource limits. Make


We should ensure the error message is as useful as possible. Ideally this would say something like: cluster is throttled to one connection: <cluster_name> is out of storage.

If we want to keep the sqlproxy part generic, which makes sense to me, then we should let the tenant directory pick the error message.

jeffswenson · 2023-03-21T14:58:12Z

pkg/ccl/sqlproxyccl/tenant/entry.go

+	// indicate that only root connections can be made to the tenant.
+	// This field can be updated over time, so a lock must be obtained before
+	// accessing it.
+	MaxNonRootConnections struct {


nit: I think we should merge MaxNonRootConnections and pods into a single mutex protected structure. I prefer course grain locking to fine grain locking. fine grain locking seems nice on the surface, but it makes it easier to introduce race conditions and deadlocks when the code is modified in the future.

struct mu { syncutil.Mutex maxActiveConnections int32 pods []*Pod }

jeffswenson · 2023-03-21T15:02:36Z

pkg/ccl/sqlproxyccl/tenantdirsvr/test_static_directory_svr.go

+// WatchTenants allows callers to monitor for pod update events.
+//
+// WatchTenants implements the tenant.DirectoryServer interface.
+func (d *TestStaticDirectoryServer) WatchTenants(


comment: I wish we could simplify the sqlproxy -> tenantdir API. In an ideal world there would be two functions:

StartTenant(tenant): starts a pod for the tenant if none are ready.

WatchTenants() -> ([]Tenant, func(delta Tenant)): list the tenants and wait for changes. The entire tenant is updated any time the crdbtenant changes or one of its pod changes.

I don't think we need to rework this now, but it is something to consider as the API gets more complicated.

jeffswenson · 2023-03-21T15:06:44Z

pkg/ccl/sqlproxyccl/proxy_handler.go

@@ -106,6 +108,9 @@ type ProxyOptions struct {
 	ThrottleBaseDelay time.Duration
 	// DisableConnectionRebalancing disables connection rebalancing for tenants.
 	DisableConnectionRebalancing bool
+	// EnableLimitNonRootConns enables the limiting of non root connections for
+	// tenants who have that limit set on them.
+	EnableLimitNonRootConns bool


nit: consider generalizing the name of this feature to something like enable-active-connection-limit. We currently exclude the root user, but we may want to exclude more users in the future. For example: we may automatically create a cockroach-cloud-ui user and we would also want to exclude that from the connection limits.

jeffswenson · 2023-03-21T15:08:58Z

pkg/ccl/sqlproxyccl/proxy_handler.go

+	}
+
+	// If the tenant has a limit specified, count the non-root connections.
+	if tenantResp.MaxNonRootConns != -1 {


What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?

This commit adds a cli flag --limit-non-root-conns to the sqlproxy. When set to true, this enables the sqlproxy to limit the number of non root connections to tenants. This is achieved by modifying the Directory interface to supply an additional field MaxNonRootConnections when getting tenants. And adding a watcher for tenants. The sqlproxy is then able to check if the current connection exceeds the MaxNonRootConnections limit for the tenant and can close it appropriately with a user friendly error. This functionality is required for Serverless. Part of: https://cockroachlabs.atlassian.net/browse/CC-9288 Release note: None

alyshanjahani-crl

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @jeffswenson)

pkg/ccl/sqlproxyccl/proxy_handler.go line 533 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?

LookupTenant returns the generated protobuf type Tenant . And protobuf fields can't be nullable AFAIK.

I suppose i could create a different type, but i don't think thats worth it.

pkg/ccl/sqlproxyccl/proxy_handler.go line 561 at r1 (raw file):

if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.

I'm not following. If the server sends something unexpected, wouldn't the subsequent steps in counting non root connections (expectDataRow expectCommandComplete expectReadyForQuery ) fail - and subsequently we'd just send back to the client internal server error .

Why would the connection go through?

pkg/ccl/sqlproxyccl/tenant/directory_cache.go line 253 at r1 (raw file):