Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: implement tenant access control policies at KV RPC boundary #52094

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #47898.

Rebased on #51503 and #52034. Ignore all but the last 3 commits.

This commit adds a collection of access control policies for the newly exposed tenant RPC server. These authorization policies ensure that an authenticated tenant is only able to access keys within its keyspace and that no tenant is able to access data from another tenant's keyspace through the tenant RPC server. This is a major step in providing crypto-backed logical isolation between tenants in a multi-tenant cluster.

The existing auth mechanism is retained on the standard RPC server, which means that the system tenant is still able to access any key in the system.

@nvanbenschoten nvanbenschoten requested review from tbg and a team July 29, 2020 22:20
@nvanbenschoten nvanbenschoten requested review from a team as code owners July 29, 2020 22:20
@nvanbenschoten nvanbenschoten requested review from miretskiy and removed request for a team July 29, 2020 22:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 22 of 24 files at r1, 7 of 7 files at r3, 29 of 29 files at r4, 28 of 28 files at r5, 3 of 5 files at r7, 1 of 1 files at r9, 9 of 14 files at r10, 21 of 21 files at r11, 2 of 2 files at r12, 8 of 8 files at r13, 6 of 6 files at r14.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @nvanbenschoten)


pkg/rpc/auth.go, line 95 at r14 (raw file):

			}
		} else {
			// TODO DURING REVIEW: is this a typo or intentional?

This must be in error - we're only calling requireSuperUser when !Insecure so in that case certainly it should insist on there being TLSInfo?
I wonder if this can be exploited. Probably not, unless we allow HTTP connections to get this far somehow, which I don't think we do.


pkg/rpc/auth_tenant.go, line 86 at r14 (raw file):

		return roachpb.TenantID{}, authErrorf("could not parse tenant ID from Common Name (CN): %s", err)
	}
	if tenID < roachpb.MinTenantID.ToUint64() || tenID > roachpb.MaxTenantID.ToUint64() {

Since I'm seeing this now, I noticed that throughout SQL DInt is an int64, so won't we break things if we actually create a MaxUint64 tenant?


pkg/rpc/auth_tenant.go, line 132 at r14 (raw file):

		}
	}
	return authErrorf("requested key span %s not fully contained in tenant keypace %s", rSpan, tenSpan)

space


pkg/rpc/auth_tenant.go, line 139 at r14 (raw file):

var batchSpanAllowlist = []roachpb.RSpan{
	// TODO(nvanbenschoten): Explore whether we can get rid of this by no longer
	// reading this key in sqlServer.start.

I think the only use of this in SQL is the sqlmigration to populate the version setting, which we will never have to run from a SQL tenant:

{
// Introduced in v1.1. Permanent migration.
name: "populate initial version cluster setting table entry",
workFn: populateVersionSetting,
clusterWide: true,
},

I'm not sure how we ever get past this line though:

ctx, "set-setting", "SET CLUSTER SETTING version = $1", v.String(),

Surely that should fail?

[email protected]:46257/defaultdb> set cluster setting version='foo';
ERROR: only the system tenant can SET CLUSTER SETTING

pkg/rpc/auth_tenant.go, line 160 at r14 (raw file):

		}
	}
	return authErrorf("requested key %s not fully contained in tenant keypace %s", args.Key, tenSpan)

space


pkg/rpc/auth_tenant.go, line 210 at r14 (raw file):

var gossipSubscriptionPatternAllowlist = []string{
	"node:.*",
	"system-db",

Do we have an issue for narrowing this down?


pkg/sql/drop_index.go, line 456 at r12 (raw file):

			// Unsplit all manually split ranges in the index so they can be
			// automatically merged by the merge queue. Gate this on being the
			// system tenant because secondary tenants aren't allowed to scan

Are secondary tenants allowed manual splits? I suspect they should not. If the answer is yes and it's not trivial to fix, could you file an issue?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @miretskiy, and @tbg)


pkg/rpc/auth.go, line 95 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This must be in error - we're only calling requireSuperUser when !Insecure so in that case certainly it should insist on there being TLSInfo?
I wonder if this can be exploited. Probably not, unless we allow HTTP connections to get this far somehow, which I don't think we do.

Addressed in new commit.


pkg/rpc/auth_tenant.go, line 86 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Since I'm seeing this now, I noticed that throughout SQL DInt is an int64, so won't we break things if we actually create a MaxUint64 tenant?

I don't think anything would break, but it might be impossible to actually create one of these MaxUint64 tenants:

sTenID := int64(tree.MustBeDInt(args[0]))
if sTenID <= 0 {
return nil, pgerror.New(pgcode.InvalidParameterValue, "tenant ID must be positive")
}

Maybe that's enough of a reason to define MaxTenantID as MakeTenantID(math.MaxInt64). What do you think?


pkg/rpc/auth_tenant.go, line 132 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

space

Done.


pkg/rpc/auth_tenant.go, line 139 at r14 (raw file):

I'm not sure how we ever get past this line though:

That's a clusterWide migration, so secondary tenants don't run it. So we don't have to worry about that one.

However, there's also this use, which seems harder to avoid:

var bootstrapVersion roachpb.Version
if err := s.execCfg.DB.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
return txn.GetProto(ctx, keys.BootstrapVersionKey, &bootstrapVersion)
}); err != nil {
return err
}

The bootstrapVersion there is plugged in to determine which SQL migrations to run based on each migration's includedInBootstrap value. I don't see an obvious way to avoid this. But it raises an interesting question – does the cluster's global bootstrap version even apply to secondary tenants? I think the answer is: kind of. The cluster's global bootstrap version places a lower bound on the binary version of the node that created the tenant, but it doesn't place an upper bound. So for example, even if the cluster was bootstrapped at version 20.2, a tenant may be created by a 21.1 binary, whose MetadataSchema may already bake in a migration that was introduced in 21.1.

So I think the "proper" fix for this would be to actually introduce a tenant-scoped bootstrapVersion key that is written when the tenant is created using the binary version of the gateway that is creating the tenant. A tenant would then consult its own bootstrap key.

But I don't really want to do that. And I don't think we need to. This whole bootstrapVersion was introduced in #41914, and it seems like an optimization. Each of these migrations should be idempotent, so they should be fine to run even if already baked-in. So we actually have options here. We could:

  1. continue using the cluster-wide BootstrapVersionKey as a lower bound for the bootstrap version of each individual tenant with an understanding (and some updated exposition in comments) that this bootstrapVersion is no longer exact.
  2. set bootstrapVersion to a hardcoded value (VersionStart20_2?) and accept that we'll run a few (idempotent) baked in migrations. This would allow us to get rid of this allowlist entirely, so I'm leaning towards that approach. The only issue is that it causes issues in the TestTenantLogic/3node-tenant/jobs and TestTenantLogic/3node-tenant/truncate logictests because we now see extra jobs for secondary tenants. So we'd need to disable those tests or at least add some conditional logic into them, which isn't yet supported in logictests (cc. @asubiotto).

I'm curious if you have opinions here.


pkg/rpc/auth_tenant.go, line 160 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

space

Done.


pkg/rpc/auth_tenant.go, line 210 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Do we have an issue for narrowing this down?

No, just this TODO. I'll open one.

EDIT: #52361.


pkg/sql/drop_index.go, line 456 at r12 (raw file):

Are secondary tenants allowed manual splits?

They shouldn't be, but they're not prevented from doing so right now. I opened #52360 to track this.

This requires us to scan the meta ranges, which is not something that
secondary tenants are allowed to do. Leaving the ranges split until
their sticky bit expires is fine, the existing code is just an
optimization.

This does serve as a reminder that:
1. we should make sure to remove sticky bits when we destroy a tenant's
   keyspace (see cockroachdb#48775).
2. we should probably not allow tenants to run `ALTER TABLE ... SPLIT AT`
   statements because we're not placing hard range count limits on tenants
   anywhere else. If others agree, I'll file an issue.
This was badly needed in a number of places. I found myself reaching for
it again, so I figured it was time to add it.
Fixes cockroachdb#47898.

Rebased on cockroachdb#51503 and cockroachdb#52034. Ignore all but the last 3 commits.

This commit adds a collection of access control policies for the newly
exposed tenant RPC server. These authorization policies ensure that an
authenticated tenant is only able to access keys within its keyspace and
that no tenant is able to access data from another tenant's keyspace
through the tenant RPC server. This is a major step in providing
crypto-backed logical isolation between tenants in a multi-tenant
cluster.

The existing auth mechanism is retained on the standard RPC server,
which means that the system tenant is still able to access any key in
the system.
It's unclear whether this was exploitable. It probably wasn't because we
should have already insisted on TLSInfo being present, but it can't hurt
to restructure the code to prevent this kind of bug.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/kvserverAccessControl branch from ca28f07 to 44983c4 Compare August 4, 2020 21:55
@nvanbenschoten
Copy link
Member Author

There are still two open questions about the BootstrapVersionKey and about whether we should set MaxTenantID to MakeTenantID(math.MaxInt64), but neither of those are blockers, so we can continue those discussions here even after this has merged.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 5, 2020

Build succeeded:

@craig craig bot merged commit 67a92cd into cockroachdb:master Aug 5, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/kvserverAccessControl branch August 7, 2020 21:10
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 10, 2020
See cockroachdb#52094 (review).

We don't currently track the bootstrap version of each secondary tenant.
For this to be meaningful, we'd need to record the binary version of the
SQL gateway that processed the crdb_internal.create_tenant function which
created the tenant, as this is what dictates the MetadataSchema that was
in effect when the secondary tenant was constructed. This binary version
very well may differ from the cluster-wide bootstrap version at which the
system tenant was bootstrapped.

Since we don't record this version anywhere, we do the next-best thing
and pass a lower-bound on the bootstrap version. We know that no tenants
could have been created before the start of the v20.2 dev cycle, so we
pass VersionStart20_2. bootstrapVersion is only used to avoid performing
superfluous but necessarily idempotent SQL migrations, so at worst, we're
doing more work than strictly necessary during the first time that the
migrations are run.

Now that we don't query BootstrapVersionKey, we don't need to have it
in the allowlists in the tenantAuth policy for Batch and RangeLookup
RPCs.
craig bot pushed a commit that referenced this pull request Aug 11, 2020
52595: sql: don't query BootstrapVersionKey on tenant SQL startup r=nvanbenschoten a=nvanbenschoten

See #52094 (review).

We don't currently track the bootstrap version of each secondary tenant.
For this to be meaningful, we'd need to record the binary version of the
SQL gateway that processed the crdb_internal.create_tenant function which
created the tenant, as this is what dictates the MetadataSchema that was
in effect when the secondary tenant was constructed. This binary version
very well may differ from the cluster-wide bootstrap version at which the
system tenant was bootstrapped.

Since we don't record this version anywhere, we do the next-best thing
and pass a lower-bound on the bootstrap version. We know that no tenants
could have been created before the start of the v20.2 dev cycle, so we
pass VersionStart20_2. bootstrapVersion is only used to avoid performing
superfluous but necessarily idempotent SQL migrations, so at worst, we're
doing more work than strictly necessary during the first time that the
migrations are run.

Now that we don't query BootstrapVersionKey, we don't need to have it
in the allowlists in the tenantAuth policy for Batch and RangeLookup
RPCs.

52616: CODEOWNERS: add notification patterns for SQL syntax and APIs r=rohany a=knz

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
RichardJCai pushed a commit to RichardJCai/cockroach that referenced this pull request Aug 11, 2020
See cockroachdb#52094 (review).

We don't currently track the bootstrap version of each secondary tenant.
For this to be meaningful, we'd need to record the binary version of the
SQL gateway that processed the crdb_internal.create_tenant function which
created the tenant, as this is what dictates the MetadataSchema that was
in effect when the secondary tenant was constructed. This binary version
very well may differ from the cluster-wide bootstrap version at which the
system tenant was bootstrapped.

Since we don't record this version anywhere, we do the next-best thing
and pass a lower-bound on the bootstrap version. We know that no tenants
could have been created before the start of the v20.2 dev cycle, so we
pass VersionStart20_2. bootstrapVersion is only used to avoid performing
superfluous but necessarily idempotent SQL migrations, so at worst, we're
doing more work than strictly necessary during the first time that the
migrations are run.

Now that we don't query BootstrapVersionKey, we don't need to have it
in the allowlists in the tenantAuth policy for Batch and RangeLookup
RPCs.
madelineliao pushed a commit to madelineliao/cockroach that referenced this pull request Aug 12, 2020
See cockroachdb#52094 (review).

We don't currently track the bootstrap version of each secondary tenant.
For this to be meaningful, we'd need to record the binary version of the
SQL gateway that processed the crdb_internal.create_tenant function which
created the tenant, as this is what dictates the MetadataSchema that was
in effect when the secondary tenant was constructed. This binary version
very well may differ from the cluster-wide bootstrap version at which the
system tenant was bootstrapped.

Since we don't record this version anywhere, we do the next-best thing
and pass a lower-bound on the bootstrap version. We know that no tenants
could have been created before the start of the v20.2 dev cycle, so we
pass VersionStart20_2. bootstrapVersion is only used to avoid performing
superfluous but necessarily idempotent SQL migrations, so at worst, we're
doing more work than strictly necessary during the first time that the
migrations are run.

Now that we don't query BootstrapVersionKey, we don't need to have it
in the allowlists in the tenantAuth policy for Batch and RangeLookup
RPCs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: authN and authZ for incoming tenant connections
4 participants