-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: implement tenant access control policies at KV RPC boundary #52094
rpc: implement tenant access control policies at KV RPC boundary #52094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 22 of 24 files at r1, 7 of 7 files at r3, 29 of 29 files at r4, 28 of 28 files at r5, 3 of 5 files at r7, 1 of 1 files at r9, 9 of 14 files at r10, 21 of 21 files at r11, 2 of 2 files at r12, 8 of 8 files at r13, 6 of 6 files at r14.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @nvanbenschoten)
pkg/rpc/auth.go, line 95 at r14 (raw file):
} } else { // TODO DURING REVIEW: is this a typo or intentional?
This must be in error - we're only calling requireSuperUser
when !Insecure
so in that case certainly it should insist on there being TLSInfo?
I wonder if this can be exploited. Probably not, unless we allow HTTP connections to get this far somehow, which I don't think we do.
pkg/rpc/auth_tenant.go, line 86 at r14 (raw file):
return roachpb.TenantID{}, authErrorf("could not parse tenant ID from Common Name (CN): %s", err) } if tenID < roachpb.MinTenantID.ToUint64() || tenID > roachpb.MaxTenantID.ToUint64() {
Since I'm seeing this now, I noticed that throughout SQL DInt
is an int64
, so won't we break things if we actually create a MaxUint64
tenant?
pkg/rpc/auth_tenant.go, line 132 at r14 (raw file):
} } return authErrorf("requested key span %s not fully contained in tenant keypace %s", rSpan, tenSpan)
space
pkg/rpc/auth_tenant.go, line 139 at r14 (raw file):
var batchSpanAllowlist = []roachpb.RSpan{ // TODO(nvanbenschoten): Explore whether we can get rid of this by no longer // reading this key in sqlServer.start.
I think the only use of this in SQL is the sqlmigration to populate the version setting, which we will never have to run from a SQL tenant:
cockroach/pkg/sqlmigrations/migrations.go
Lines 108 to 113 in 9547502
{ | |
// Introduced in v1.1. Permanent migration. | |
name: "populate initial version cluster setting table entry", | |
workFn: populateVersionSetting, | |
clusterWide: true, | |
}, |
I'm not sure how we ever get past this line though:
cockroach/pkg/sqlmigrations/migrations.go
Line 1684 in 9547502
ctx, "set-setting", "SET CLUSTER SETTING version = $1", v.String(), |
Surely that should fail?
[email protected]:46257/defaultdb> set cluster setting version='foo';
ERROR: only the system tenant can SET CLUSTER SETTING
pkg/rpc/auth_tenant.go, line 160 at r14 (raw file):
} } return authErrorf("requested key %s not fully contained in tenant keypace %s", args.Key, tenSpan)
space
pkg/rpc/auth_tenant.go, line 210 at r14 (raw file):
var gossipSubscriptionPatternAllowlist = []string{ "node:.*", "system-db",
Do we have an issue for narrowing this down?
pkg/sql/drop_index.go, line 456 at r12 (raw file):
// Unsplit all manually split ranges in the index so they can be // automatically merged by the merge queue. Gate this on being the // system tenant because secondary tenants aren't allowed to scan
Are secondary tenants allowed manual splits? I suspect they should not. If the answer is yes and it's not trivial to fix, could you file an issue?
23017e8
to
ca28f07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @miretskiy, and @tbg)
pkg/rpc/auth.go, line 95 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This must be in error - we're only calling
requireSuperUser
when!Insecure
so in that case certainly it should insist on there being TLSInfo?
I wonder if this can be exploited. Probably not, unless we allow HTTP connections to get this far somehow, which I don't think we do.
Addressed in new commit.
pkg/rpc/auth_tenant.go, line 86 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Since I'm seeing this now, I noticed that throughout SQL
DInt
is anint64
, so won't we break things if we actually create aMaxUint64
tenant?
I don't think anything would break, but it might be impossible to actually create one of these MaxUint64
tenants:
cockroach/pkg/sql/sem/builtins/builtins.go
Lines 3280 to 3283 in 986e308
sTenID := int64(tree.MustBeDInt(args[0])) | |
if sTenID <= 0 { | |
return nil, pgerror.New(pgcode.InvalidParameterValue, "tenant ID must be positive") | |
} |
Maybe that's enough of a reason to define MaxTenantID
as MakeTenantID(math.MaxInt64)
. What do you think?
pkg/rpc/auth_tenant.go, line 132 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
space
Done.
pkg/rpc/auth_tenant.go, line 139 at r14 (raw file):
I'm not sure how we ever get past this line though:
That's a clusterWide
migration, so secondary tenants don't run it. So we don't have to worry about that one.
However, there's also this use, which seems harder to avoid:
cockroach/pkg/server/server_sql.go
Lines 672 to 677 in b380060
var bootstrapVersion roachpb.Version | |
if err := s.execCfg.DB.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { | |
return txn.GetProto(ctx, keys.BootstrapVersionKey, &bootstrapVersion) | |
}); err != nil { | |
return err | |
} |
The bootstrapVersion
there is plugged in to determine which SQL migrations to run based on each migration's includedInBootstrap
value. I don't see an obvious way to avoid this. But it raises an interesting question – does the cluster's global bootstrap version even apply to secondary tenants? I think the answer is: kind of. The cluster's global bootstrap version places a lower bound on the binary version of the node that created the tenant, but it doesn't place an upper bound. So for example, even if the cluster was bootstrapped at version 20.2, a tenant may be created by a 21.1 binary, whose MetadataSchema
may already bake in a migration that was introduced in 21.1.
So I think the "proper" fix for this would be to actually introduce a tenant-scoped bootstrapVersion
key that is written when the tenant is created using the binary version of the gateway that is creating the tenant. A tenant would then consult its own bootstrap key.
But I don't really want to do that. And I don't think we need to. This whole bootstrapVersion
was introduced in #41914, and it seems like an optimization. Each of these migrations should be idempotent, so they should be fine to run even if already baked-in. So we actually have options here. We could:
- continue using the cluster-wide BootstrapVersionKey as a lower bound for the bootstrap version of each individual tenant with an understanding (and some updated exposition in comments) that this
bootstrapVersion
is no longer exact. - set
bootstrapVersion
to a hardcoded value (VersionStart20_2
?) and accept that we'll run a few (idempotent) baked in migrations. This would allow us to get rid of this allowlist entirely, so I'm leaning towards that approach. The only issue is that it causes issues in theTestTenantLogic/3node-tenant/jobs
andTestTenantLogic/3node-tenant/truncate
logictests because we now see extra jobs for secondary tenants. So we'd need to disable those tests or at least add some conditional logic into them, which isn't yet supported in logictests (cc. @asubiotto).
I'm curious if you have opinions here.
pkg/rpc/auth_tenant.go, line 160 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
space
Done.
pkg/rpc/auth_tenant.go, line 210 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Do we have an issue for narrowing this down?
No, just this TODO. I'll open one.
EDIT: #52361.
pkg/sql/drop_index.go, line 456 at r12 (raw file):
Are secondary tenants allowed manual splits?
They shouldn't be, but they're not prevented from doing so right now. I opened #52360 to track this.
This requires us to scan the meta ranges, which is not something that secondary tenants are allowed to do. Leaving the ranges split until their sticky bit expires is fine, the existing code is just an optimization. This does serve as a reminder that: 1. we should make sure to remove sticky bits when we destroy a tenant's keyspace (see cockroachdb#48775). 2. we should probably not allow tenants to run `ALTER TABLE ... SPLIT AT` statements because we're not placing hard range count limits on tenants anywhere else. If others agree, I'll file an issue.
This was badly needed in a number of places. I found myself reaching for it again, so I figured it was time to add it.
Fixes cockroachdb#47898. Rebased on cockroachdb#51503 and cockroachdb#52034. Ignore all but the last 3 commits. This commit adds a collection of access control policies for the newly exposed tenant RPC server. These authorization policies ensure that an authenticated tenant is only able to access keys within its keyspace and that no tenant is able to access data from another tenant's keyspace through the tenant RPC server. This is a major step in providing crypto-backed logical isolation between tenants in a multi-tenant cluster. The existing auth mechanism is retained on the standard RPC server, which means that the system tenant is still able to access any key in the system.
It's unclear whether this was exploitable. It probably wasn't because we should have already insisted on TLSInfo being present, but it can't hurt to restructure the code to prevent this kind of bug.
ca28f07
to
44983c4
Compare
There are still two open questions about the TFTR! bors r+ |
Build succeeded: |
See cockroachdb#52094 (review). We don't currently track the bootstrap version of each secondary tenant. For this to be meaningful, we'd need to record the binary version of the SQL gateway that processed the crdb_internal.create_tenant function which created the tenant, as this is what dictates the MetadataSchema that was in effect when the secondary tenant was constructed. This binary version very well may differ from the cluster-wide bootstrap version at which the system tenant was bootstrapped. Since we don't record this version anywhere, we do the next-best thing and pass a lower-bound on the bootstrap version. We know that no tenants could have been created before the start of the v20.2 dev cycle, so we pass VersionStart20_2. bootstrapVersion is only used to avoid performing superfluous but necessarily idempotent SQL migrations, so at worst, we're doing more work than strictly necessary during the first time that the migrations are run. Now that we don't query BootstrapVersionKey, we don't need to have it in the allowlists in the tenantAuth policy for Batch and RangeLookup RPCs.
52595: sql: don't query BootstrapVersionKey on tenant SQL startup r=nvanbenschoten a=nvanbenschoten See #52094 (review). We don't currently track the bootstrap version of each secondary tenant. For this to be meaningful, we'd need to record the binary version of the SQL gateway that processed the crdb_internal.create_tenant function which created the tenant, as this is what dictates the MetadataSchema that was in effect when the secondary tenant was constructed. This binary version very well may differ from the cluster-wide bootstrap version at which the system tenant was bootstrapped. Since we don't record this version anywhere, we do the next-best thing and pass a lower-bound on the bootstrap version. We know that no tenants could have been created before the start of the v20.2 dev cycle, so we pass VersionStart20_2. bootstrapVersion is only used to avoid performing superfluous but necessarily idempotent SQL migrations, so at worst, we're doing more work than strictly necessary during the first time that the migrations are run. Now that we don't query BootstrapVersionKey, we don't need to have it in the allowlists in the tenantAuth policy for Batch and RangeLookup RPCs. 52616: CODEOWNERS: add notification patterns for SQL syntax and APIs r=rohany a=knz Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
See cockroachdb#52094 (review). We don't currently track the bootstrap version of each secondary tenant. For this to be meaningful, we'd need to record the binary version of the SQL gateway that processed the crdb_internal.create_tenant function which created the tenant, as this is what dictates the MetadataSchema that was in effect when the secondary tenant was constructed. This binary version very well may differ from the cluster-wide bootstrap version at which the system tenant was bootstrapped. Since we don't record this version anywhere, we do the next-best thing and pass a lower-bound on the bootstrap version. We know that no tenants could have been created before the start of the v20.2 dev cycle, so we pass VersionStart20_2. bootstrapVersion is only used to avoid performing superfluous but necessarily idempotent SQL migrations, so at worst, we're doing more work than strictly necessary during the first time that the migrations are run. Now that we don't query BootstrapVersionKey, we don't need to have it in the allowlists in the tenantAuth policy for Batch and RangeLookup RPCs.
See cockroachdb#52094 (review). We don't currently track the bootstrap version of each secondary tenant. For this to be meaningful, we'd need to record the binary version of the SQL gateway that processed the crdb_internal.create_tenant function which created the tenant, as this is what dictates the MetadataSchema that was in effect when the secondary tenant was constructed. This binary version very well may differ from the cluster-wide bootstrap version at which the system tenant was bootstrapped. Since we don't record this version anywhere, we do the next-best thing and pass a lower-bound on the bootstrap version. We know that no tenants could have been created before the start of the v20.2 dev cycle, so we pass VersionStart20_2. bootstrapVersion is only used to avoid performing superfluous but necessarily idempotent SQL migrations, so at worst, we're doing more work than strictly necessary during the first time that the migrations are run. Now that we don't query BootstrapVersionKey, we don't need to have it in the allowlists in the tenantAuth policy for Batch and RangeLookup RPCs.
Fixes #47898.
Rebased on #51503 and #52034. Ignore all but the last 3 commits.
This commit adds a collection of access control policies for the newly exposed tenant RPC server. These authorization policies ensure that an authenticated tenant is only able to access keys within its keyspace and that no tenant is able to access data from another tenant's keyspace through the tenant RPC server. This is a major step in providing crypto-backed logical isolation between tenants in a multi-tenant cluster.
The existing auth mechanism is retained on the standard RPC server, which means that the system tenant is still able to access any key in the system.