-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlproxyccl: Add support for limiting non root connections #99056
sqlproxyccl: Add support for limiting non root connections #99056
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
81cac6c
to
57c49ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.
I'm putting this PR mainly to get feedback on this implementation, and for others to see what a potential implementation in the sqlproxy may look like.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach LGTM. The only concern that I have is potentially adding all the modified tenants into memory even if we don't need it.
It is still TBD whether this is the route we want to go w.r.t limiting the number of connections for serverless clusters that have hit their resource limits.
What other alternatives do you have?
Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl)
pkg/ccl/sqlproxyccl/authentication.go
line 148 at r1 (raw file):
// serve queries. case *pgproto3.ReadyForQuery: limitError := limitHook(crdbConn)
nit: would be nice to include the reasoning behind putting the limit hook here (before sending ReadyForQuery) vs after.
pkg/ccl/sqlproxyccl/proxy_handler.go
line 561 at r1 (raw file):
ctx, pgServerFeConn, io.Discard,
nit: I think we can reuse &errWriter{}
here. That way, if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.
pkg/ccl/sqlproxyccl/tenant/directory.proto
line 97 at r1 (raw file):
} message Tenant {
nit: comments describing the Tenant message; same to all the other fields
pkg/ccl/sqlproxyccl/tenant/directory.proto
line 143 at r1 (raw file):
// WatchPods is first called, it returns notifications for all existing pods. rpc WatchPods(WatchPodsRequest) returns (stream WatchPodsResponse); rpc WatchTenants(WatchTenantsRequest) returns (stream WatchTenantsResponse);
nit: comments are missing
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 247 at r1 (raw file):
func (d *directoryCache) LookupTenant(ctx context.Context,
nit: This function is exported; need comments. See LookupTenantPods
.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 253 at r1 (raw file):
if err != nil { return Tenant{}, err }
Two questions:
- Can LookupTenant only take in a tenantID? Given that we have already validated the clusterName earlier (in order to connect to the cluster), I don't see a reason why we need to match on cluster name again here.
- If (1) is valid, I think we can just call
d.getEntry
here with allowCreate=false.
Code quote:
tenantID roachpb.TenantID, clusterName string) (Tenant, error) {
// Ensure that a directory entry has been created for this tenant.
entry, err := d.ensureDirectoryEntry(ctx, tenantID, clusterName)
if err != nil {
return Tenant{}, err
}
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 255 at r1 (raw file):
} return Tenant{
I'd recommend adding a ToProto()
on the tenantEntry that returns this struct.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 486 at r1 (raw file):
// sends modified events. Deleted events are of no interest since a deleted tenant would // have no backend for the sqlproxy to connect to. Created events are of no interest since // a tenant entry will get initialized in getEntry if it does not exist.
Regardless of implementation, this code should be able to handle all cases.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 529 at r1 (raw file):
// Update the directory entry for the tenant with the latest // information about this tenant. d.updateTenantEntryWithTenant(ctx, resp.Tenant)
I think we should check resp.Tenant != nil here.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 584 at r1 (raw file):
func (d *directoryCache) updateTenantEntryWithTenant(ctx context.Context, tenant *Tenant) { // Ensure that a directory entry exists for this tenant. entry, err := d.getEntry(ctx, roachpb.MustMakeTenantID(tenant.TenantID), true /* allowCreate */)
If an entry does not exist in the directory cache, should we just ignore? Given that this sends modified events, there's a possibility where all the tenants in the host get modified, and we don't want to be creating entries for all of them here, especially when once created, they are not GC'ed. I'd imagine this should only care about tenants that are already in the cache. What do you think?
pkg/ccl/sqlproxyccl/tenant/entry.go
line 96 at r1 (raw file):
// we can take ClusterName from that. e.ClusterName = tenantResp.ClusterName e.SetMaxNonRootConnections(tenantResp.Tenant.MaxNonRootConns)
What if tenantResp.Tenant is nil?
pkg/ccl/sqlproxyccl/tenant/entry.go
line 146 at r1 (raw file):
} func (e *tenantEntry) SetMaxNonRootConnections(numNonRoot int32){
nit: These exported methods need comment.
pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go
line 231 at r1 (raw file):
} func (s *TestDirectoryServer) WatchTenants(
nit: comments
pkg/ccl/sqlproxyccl/tenantdirsvr/test_simple_directory_svr.go
line 82 at r1 (raw file):
// WatchPods is a no-op for the simple directory. // // WatchPods implements the tenant.DirectoryServer interface.
s/WatchPods/WatchTenants
@@ -162,11 +167,20 @@ const throttledErrorHint string = `Connection throttling is triggered by repeate | |||
sure the username and password are correct. | |||
` | |||
|
|||
const resourceLimitedErrorHint string = `Connection limiting is triggered by insufficient resource limits. Make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure the error message is as useful as possible. Ideally this would say something like: cluster is throttled to one connection: <cluster_name> is out of storage
.
If we want to keep the sqlproxy part generic, which makes sense to me, then we should let the tenant directory pick the error message.
pkg/ccl/sqlproxyccl/tenant/entry.go
Outdated
// indicate that only root connections can be made to the tenant. | ||
// This field can be updated over time, so a lock must be obtained before | ||
// accessing it. | ||
MaxNonRootConnections struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we should merge MaxNonRootConnections and pods into a single mutex protected structure. I prefer course grain locking to fine grain locking. fine grain locking seems nice on the surface, but it makes it easier to introduce race conditions and deadlocks when the code is modified in the future.
struct mu {
syncutil.Mutex
maxActiveConnections int32
pods []*Pod
}
// WatchTenants allows callers to monitor for pod update events. | ||
// | ||
// WatchTenants implements the tenant.DirectoryServer interface. | ||
func (d *TestStaticDirectoryServer) WatchTenants( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment: I wish we could simplify the sqlproxy -> tenantdir API. In an ideal world there would be two functions:
- StartTenant(tenant): starts a pod for the tenant if none are ready.
- WatchTenants() -> ([]Tenant, func(delta Tenant)): list the tenants and wait for changes. The entire tenant is updated any time the crdbtenant changes or one of its pod changes.
I don't think we need to rework this now, but it is something to consider as the API gets more complicated.
@@ -106,6 +108,9 @@ type ProxyOptions struct { | |||
ThrottleBaseDelay time.Duration | |||
// DisableConnectionRebalancing disables connection rebalancing for tenants. | |||
DisableConnectionRebalancing bool | |||
// EnableLimitNonRootConns enables the limiting of non root connections for | |||
// tenants who have that limit set on them. | |||
EnableLimitNonRootConns bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider generalizing the name of this feature to something like enable-active-connection-limit
. We currently exclude the root user, but we may want to exclude more users in the future. For example: we may automatically create a cockroach-cloud-ui user and we would also want to exclude that from the connection limits.
} | ||
|
||
// If the tenant has a limit specified, count the non-root connections. | ||
if tenantResp.MaxNonRootConns != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?
This commit adds a cli flag --limit-non-root-conns to the sqlproxy. When set to true, this enables the sqlproxy to limit the number of non root connections to tenants. This is achieved by modifying the Directory interface to supply an additional field MaxNonRootConnections when getting tenants. And adding a watcher for tenants. The sqlproxy is then able to check if the current connection exceeds the MaxNonRootConnections limit for the tenant and can close it appropriately with a user friendly error. This functionality is required for Serverless. Part of: https://cockroachlabs.atlassian.net/browse/CC-9288 Release note: None
57c49ba
to
7dc9fee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @jeffswenson)
pkg/ccl/sqlproxyccl/proxy_handler.go
line 533 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
What do you think of making MaxNonRootConns nullable instead of using -1 as a sentinel value?
LookupTenant
returns the generated protobuf type Tenant
. And protobuf fields can't be nullable AFAIK.
I suppose i could create a different type, but i don't think thats worth it.
pkg/ccl/sqlproxyccl/proxy_handler.go
line 561 at r1 (raw file):
if the server writes something to the client that's unexpected, we can error out instead of letting the connection go through silently.
I'm not following. If the server sends something unexpected, wouldn't the subsequent steps in counting non root connections (expectDataRow
expectCommandComplete
expectReadyForQuery
) fail - and subsequently we'd just send back to the client internal server error
.
Why would the connection go through?
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 253 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
Two questions:
- Can LookupTenant only take in a tenantID? Given that we have already validated the clusterName earlier (in order to connect to the cluster), I don't see a reason why we need to match on cluster name again here.
- If (1) is valid, I think we can just call
d.getEntry
here with allowCreate=false.
Yes however, i don't want to bake in the assumption that LookupTenant
will always be called after LookupTenantPods
.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 255 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I'd recommend adding a
ToProto()
on the tenantEntry that returns this struct.
Done.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 529 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I think we should check resp.Tenant != nil here.
Done in updateTenantEntryWithTenant
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 584 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
If an entry does not exist in the directory cache, should we just ignore? Given that this sends modified events, there's a possibility where all the tenants in the host get modified, and we don't want to be creating entries for all of them here, especially when once created, they are not GC'ed. I'd imagine this should only care about tenants that are already in the cache. What do you think?
Yes, changed to set allowCreate
to false.
pkg/ccl/sqlproxyccl/tenant/entry.go
line 96 at r1 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
What if tenantResp.Tenant is nil?
oh i didn't realize that it could be... I thought protobuf fields always defaulted to zero values if not present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other alternatives do you have?
So i discussed w/ @rafiss a bit and came across this cluster setting that limits connections per gateway/node (#76401).
So another approach would be to use a similar cluster setting, see #100200
In that approach, on the Cockroach Cloud side we would limit the max sql pods for a tenant and use that new cluster setting to limit the non root connections per pod.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl and @jeffswenson)
closing as we went with the cluster setting approach #100200 |
This commit adds a cli flag
--limit-non-root-conns
to the sqlproxy. When set to true, this enables the sqlproxy to limit the number of non root connections to tenants. This is achieved by modifying the Directory interface to supply an additional fieldMaxNonRootConnections
when getting tenants. And adding a watcher for tenants. The sqlproxy is then able to check if the current connection exceeds theMaxNonRootConnections
limit for the tenant and can close it appropriately with a user friendly error.This functionality is required for Serverless.
Part of: https://cockroachlabs.atlassian.net/browse/CC-9288
Release note: None