Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: authN and authZ for incoming tenant connections #47898

Closed
tbg opened this issue Apr 22, 2020 · 9 comments · Fixed by #52094
Closed

kvserver: authN and authZ for incoming tenant connections #47898

tbg opened this issue Apr 22, 2020 · 9 comments · Fixed by #52094
Assignees
Labels
A-multitenancy Related to multi-tenancy

Comments

@tbg
Copy link
Member

tbg commented Apr 22, 2020

See here for background

SQL tenant servers will make outgoing connections, most prominently to the KV layer. We need to come up with way to encrypt and authenticate these connections.

This is still in the design phase but generally will be based on some crypto. The design will apply more broadly than to the SQL tenant connections - this issue tracks only the work to apply it to these connections.

An internal, early, work in progress strawman is here.

The auth broker is tracked in #49105.

@tbg tbg added A-multitenancy Related to multi-tenancy A-multitenancy-blocker labels Apr 22, 2020
@tbg tbg assigned knz, tbg and aaron-crl Apr 22, 2020
@knz knz changed the title security: connections from SQL tenants security: connections from SQL tenant servers Apr 22, 2020
@nvanbenschoten
Copy link
Member

Per our discussion today, it does look like gRPC supports authentication on connection startup. From https://grpc.io/docs/guides/auth/:

Credentials can be of two types:

  • Channel credentials, which are attached to a Channel, such as SSL credentials.
  • Call credentials, which are attached to a call (or ClientContext in C++).

I think you're aware of this because you use both in https://github.com/cockroachdb/cockroach/compare/master...tbg:poc/tokenauth?expand=1#diff-14fd680bac02bef9b58a5fd4fed0ff8bR141. So it seems like it should be possible to include the tenant auth token during connection startup and avoid passing an extra 64 bytes on each individual RPC.

There would be some complication around wrapping a tlsCreds (return value of credentials.NewTLS) with our own TransportCredentials implementation that adds the auth token to the initial handshake, but this all seems tractable.

@tbg
Copy link
Member Author

tbg commented May 14, 2020

Oh, nice. I will give that a try. Thanks!

@tbg tbg changed the title security: connections from SQL tenant servers kvserver: authN and authZ for incoming tenant connections May 15, 2020
@tbg
Copy link
Member Author

tbg commented May 15, 2020

That looks like it should work, but it goes back to "why don't we just use client certificates"? It really starts smelling like we're getting into a rebuild of client certs. It looks like maybe we should - the real difficulty with client certs (or equivalently your proposal) is that the expiration check needs to happen per-RPC (plus a little work on the RPCContext to proactively close conns that are close to expiring). I think we can make information from the conn available to the per-RPC interceptors though. Everything there is marked as experimental, but let's assume that's fine:

The per-RPC interceptor can call

https://pkg.go.dev/google.golang.org/grpc/credentials?tab=doc#RequestInfoFromContext

which should really be a *TLSInfo:

https://pkg.go.dev/google.golang.org/grpc/credentials?tab=doc#TLSInfo

which contains AuthInfo which contains TLSConnectionState

https://pkg.go.dev/crypto/tls?tab=doc#ConnectionState

which has the VerifiedChains which are x509.Certificates:

https://pkg.go.dev/crypto/x509?tab=doc#Certificate

and those have the NotBefore and NotAfter. The tenantID can be put into https://pkg.go.dev/crypto/x509/pkix?tab=doc#Name

I tried this out just now and immediately got a reality check. We're on gRPC v1.21.2, but all of this API is just four months old at the time of writing, and (I think, judging from the milestone they gave the PR) is in gRPC v1.27. Possibly good news is that the current release is v1.29 and there's already a v1.27.1 (so v1.27 is possibly a reasonable candidate to upgrade to) but it leaves me uneasy to have a gRPC upgrade in the critical path here.

@petermattis I would be curious how anxious you are about upgrading gRPC. It is about the right time in the cycle to do so, yet it's unclear that we have a strong incentive (aside from this issue). There are a few bug fixes/improvements possibly worth picking up:

@tbg
Copy link
Member Author

tbg commented May 15, 2020

Ha! Double-good news. I missed another API that gives us this already, which we are already using today:

} else if peer, ok := peer.FromContext(ctx); ok {
if tlsInfo, ok := peer.AuthInfo.(credentials.TLSInfo); ok {

So it doesn't seem like we need to upgrade after all.

@petermattis
Copy link
Collaborator

@petermattis I would be curious how anxious you are about upgrading gRPC. It is about the right time in the cycle to do so, yet it's unclear that we have a strong incentive (aside from this issue). There are a few bug fixes/improvements possibly worth picking up:

I have no objections to upgrading grpc. This would be the right time to do an upgrade.

@tbg
Copy link
Member Author

tbg commented May 15, 2020

Ok, I opened a PR: #49114

Note that we don't care about it for this issue.

@tbg
Copy link
Member Author

tbg commented May 15, 2020

I added a commit to https://github.com/tbg/cockroach/tree/poc/tokenauth that prototypes the use of client certs using mostly cobbled together pieces of our existing security package. I verified that the *Peer lets us see the tenant ID and expiration in the server-side interceptors.

I must say that the security package is pretty hostile to reuse, but that can be fixed. After that, we'll be using pretty much our existing infra throughout, which is nice.

@tbg
Copy link
Member Author

tbg commented May 27, 2020

Reminder to self: as we do this, also make sure the RPC heartbeats between SQL and KV stop checking the cluster name, and can never bring down a KV node due to clock mismatch.

tbg added a commit to tbg/cockroach that referenced this issue Jun 4, 2020
This commit introduces methods that create a CA certificate for use with
the auth broker envisioned in cockroachdb#49105 and, from that CA, client
certificates for use in cockroachdb#47898. They are not exposed through the cli
(yet), though the certificate loader already supports them.

Touches cockroachdb#49105.
Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jun 5, 2020
This adds a CLI command to start a SQL tenant in a standalone process.

The tenant currently communicates with the KV layer "as a node"; this
will only change with cockroachdb#47898. As is, the tenant has full access to the
KV layer and so a few things may work that shouldn't as a result of
that.

Additionally, the tenant runs a Gossip client as a temporary stopgap
until we have removed the remaining dependencies on it (cockroachdb#49693 has
the details on what those are).

Apart from that, though, this is the real thing and can be used to
start setting up end-to-end testing of ORMs as well as the deploy-
ments.

A word on the choice of the `mt` command: `mt` is an abbreviation for
`multi-tenant` which was considered too long; just `tenant` was
considered misleading - there will be multiple additional subcommands
housing the other services required for running the whole infrastructure
including certs generation, the KV auth broker server, and the SQL
proxy. Should a lively bikeshed around naming break out we can rename
the commands later when consensus has been reached.

For those curious to try it out the following will be handy:

```bash
rm -rf ~/.cockroach-certs cockroach-data &&
killall -9 cockroach && \
export COCKROACH_CA_KEY=$HOME/.cockroach-certs/ca.key && \
./cockroach cert create-ca && \
./cockroach cert create-client root && \
./cockroach cert create-node 127.0.0.1 && \
./cockroach start --host 127.0.0.1 --background && \
./cockroach sql --host 127.0.0.1 -e 'select crdb_internal.create_tenant(123);'
./cockroach mt start-sql --tenant-id 123 --kv-addrs 127.0.0.1:26257 --sql-addr :5432 &
sleep 1 && \
./cockroach sql --host 127.0.0.1 --port 5432
```

This roughly matches what the newly introduced `acceptance/multitenant`
roachtest does as well.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jun 5, 2020
This adds a CLI command to start a SQL tenant in a standalone process.

The tenant currently communicates with the KV layer "as a node"; this
will only change with cockroachdb#47898. As is, the tenant has full access to the
KV layer and so a few things may work that shouldn't as a result of
that.

Additionally, the tenant runs a Gossip client as a temporary stopgap
until we have removed the remaining dependencies on it (cockroachdb#49693 has
the details on what those are).

Apart from that, though, this is the real thing and can be used to
start setting up end-to-end testing of ORMs as well as the deploy-
ments.

A word on the choice of the `mt` command: `mt` is an abbreviation for
`multi-tenant` which was considered too long; just `tenant` was
considered misleading - there will be multiple additional subcommands
housing the other services required for running the whole infrastructure
including certs generation, the KV auth broker server, and the SQL
proxy. Should a lively bikeshed around naming break out we can rename
the commands later when consensus has been reached.

For those curious to try it out the following will be handy:

```bash
rm -rf ~/.cockroach-certs cockroach-data &&
killall -9 cockroach && \
export COCKROACH_CA_KEY=$HOME/.cockroach-certs/ca.key && \
./cockroach cert create-ca && \
./cockroach cert create-client root && \
./cockroach cert create-node 127.0.0.1 && \
./cockroach start --host 127.0.0.1 --background && \
./cockroach sql --host 127.0.0.1 -e 'select crdb_internal.create_tenant(123);'
./cockroach mt start-sql --tenant-id 123 --kv-addrs 127.0.0.1:26257 --sql-addr :5432 &
sleep 1 && \
./cockroach sql --host 127.0.0.1 --port 5432
```

This roughly matches what the newly introduced `acceptance/multitenant`
roachtest does as well.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 21, 2020
As of this commit, tenants would use their proper tenant client certs if
it weren't for a manual override that was added.

This override exists because the KV layer can not yet authenticate
tenant client certs (this will change soon, in a follow-up to cockroachdb#50503).

However, uncommenting both the override and the hack in
`pkg/security/securitytest/test_certs/regenerate.sh` to make the tenant
client certs match those used by the KV nodes gives early validation
that this "will work" once the KV side plays ball.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 21, 2020
With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 28, 2020
As of this commit, tenants would use their proper tenant client certs if
it weren't for a manual override that was added.

This override exists because the KV layer can not yet authenticate
tenant client certs (this will change soon, in a follow-up to cockroachdb#50503).

However, uncommenting both the override and the hack in
`pkg/security/securitytest/test_certs/regenerate.sh` to make the tenant
client certs match those used by the KV nodes gives early validation
that this "will work" once the KV side plays ball.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 28, 2020
With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches cockroachdb#47898.

Release note: None
@tbg tbg removed their assignment Jul 28, 2020
nvanbenschoten pushed a commit to nvanbenschoten/cockroach that referenced this issue Jul 29, 2020
As of this commit, tenants would use their proper tenant client certs if
it weren't for a manual override that was added.

This override exists because the KV layer can not yet authenticate
tenant client certs (this will change soon, in a follow-up to cockroachdb#50503).

However, uncommenting both the override and the hack in
`pkg/security/securitytest/test_certs/regenerate.sh` to make the tenant
client certs match those used by the KV nodes gives early validation
that this "will work" once the KV side plays ball.

Touches cockroachdb#47898.

Release note: None
nvanbenschoten pushed a commit to nvanbenschoten/cockroach that referenced this issue Jul 29, 2020
With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches cockroachdb#47898.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 29, 2020
Fixes cockroachdb#47898.

Rebased on cockroachdb#51503 and cockroachdb#52034. Ignore all but the last 3 commits.

This commit adds a collection of access control policies for the newly
exposed tenant RPC server. These authorization policies ensure that an
authenticated tenant is only able to access keys within its keyspace and
that no tenant is able to access data from another tenant's keyspace
through the tenant RPC server. This is a major step in providing
crypto-backed logical isolation between tenants in a multi-tenant
cluster.

The existing auth mechanism is retained on the standard RPC server,
which means that the system tenant is still able to access any key in
the system.
tbg added a commit to tbg/cockroach that referenced this issue Jul 31, 2020
As of this commit, tenants would use their proper tenant client certs if
it weren't for a manual override that was added.

This override exists because the KV layer can not yet authenticate
tenant client certs (this will change soon, in a follow-up to cockroachdb#50503).

However, uncommenting both the override and the hack in
`pkg/security/securitytest/test_certs/regenerate.sh` to make the tenant
client certs match those used by the KV nodes gives early validation
that this "will work" once the KV side plays ball.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 31, 2020
With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 4, 2020
As of this commit, tenants would use their proper tenant client certs if
it weren't for a manual override that was added.

This override exists because the KV layer can not yet authenticate
tenant client certs (this will change soon, in a follow-up to cockroachdb#50503).

However, uncommenting both the override and the hack in
`pkg/security/securitytest/test_certs/regenerate.sh` to make the tenant
client certs match those used by the KV nodes gives early validation
that this "will work" once the KV side plays ball.

Touches cockroachdb#47898.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 4, 2020
With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches cockroachdb#47898.

Release note: None
craig bot pushed a commit that referenced this issue Aug 4, 2020
51503: rpc: use tenant client/server certs r=nvanbenschoten a=tbg


With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches #47898.

Release note: None

- cli/flags,config: new flag for tenant KV listen addr
- sql: route tenant KV traffic to tenant KV address
- roachtest: configure --tenant-addr flag in acceptance/multitenant
- rpc: add TenantID to rpc.ContextOptions
- security: slight test improvement
- rpc: pass TenantID to SecurityContext to CertManager
- security: use a single test_certs dir
- security: embed certs for a few hard-coded tenants
- rpc: *almost* use tenant client certs (on tenants)
- rpc: use tenant client/server certs where appropriate


52281: bulkio: Correctly handle exhausting retries when reading from HTTP. r=rohany a=miretskiy

Fixes #52279

Return an error if we exhaust the retry budget when reading from HTTP.

Release Notes: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
craig bot pushed a commit that referenced this issue Aug 4, 2020
51503: rpc: use tenant client/server certs r=nvanbenschoten a=tbg


With this PR, tenant SQL servers use tenant client certificates to
connect to the tenant server endpoint at the KV layer. They were
previously using the KV-internal node certs.

No validation is performed as of this PR, but this is the obvious
next step.

Follow-up work will assertions that make sure that we don't see
tenants "accidentally" use the node certs for some operations
when they are available (as is typically the case during testing).

Finally, there will be some work on the heartbeats exchanged by
the RPC context. We don't want a SQL tenant's time signal to
ever trigger KV nodes to crash, for example.

Touches #47898.

Release note: None

- cli/flags,config: new flag for tenant KV listen addr
- sql: route tenant KV traffic to tenant KV address
- roachtest: configure --tenant-addr flag in acceptance/multitenant
- rpc: add TenantID to rpc.ContextOptions
- security: slight test improvement
- rpc: pass TenantID to SecurityContext to CertManager
- security: use a single test_certs dir
- security: embed certs for a few hard-coded tenants
- rpc: *almost* use tenant client certs (on tenants)
- rpc: use tenant client/server certs where appropriate


52108: sql: cleanup handling of EXPLAIN in the new factory r=yuzefovich a=yuzefovich

**sql: cleanup ordinality code a bit**

Release note: None

**sql: cleanup handling of EXPLAIN in the new factory**

Previously, several EXPLAIN variants were mishandled in the new factory:
instead of being wrapped as other planNodes are they were kept separate
which resulted in a "mixed"-representation plan. This is now fixed by
properly wrapping them.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Aug 5, 2020
51803: cmd/docgen: add HTTP extractor r=mjibson a=mjibson

Add a way to extract docs from the status.proto HTTP endpoint. These
can be imported into the docs project as needed.

Release note: None

52083: roachtest: small misc r=andreimatei a=andreimatei

See individual commits.

52094: rpc: implement tenant access control policies at KV RPC boundary r=nvanbenschoten a=nvanbenschoten

Fixes #47898.

Rebased on #51503 and #52034. Ignore all but the last 3 commits.

This commit adds a collection of access control policies for the newly exposed tenant RPC server. These authorization policies ensure that an authenticated tenant is only able to access keys within its keyspace and that no tenant is able to access data from another tenant's keyspace through the tenant RPC server. This is a major step in providing crypto-backed logical isolation between tenants in a multi-tenant cluster.

The existing auth mechanism is retained on the standard RPC server, which means that the system tenant is still able to access any key in the system.

52352: sql/pgwire: add regression test for varchar OIDs in RowDescription r=jordanlewis a=rafiss

See issue #51360. The bug described in it was fixed somewhat
accidentally, so this test will verify that we don't regress again.

Release note: None

52386: opt: add SerializingProject exec primitive r=RaduBerinde a=RaduBerinde

The top-level projection of a query has a special property - it can project away
columns that we want an ordering on (e.g. `SELECT a FROM t ORDER BY b`).

The distsql physical planner was designed to tolerate such cases, as they were
much more common with the heuristic planner. But the new distsql exec factory
does not; it currently relies on a hack: it detects this case by checking if the
required output ordering is `nil`. This is fragile and doesn't work in all
cases.

This change adds a `SerializingProject` primitive which is like a SimpleProject
but it forces serialization of all parallel streams into one. The new primitive
is used to enforce the final query presentation. We only need to pass column
names for the presentation, so we remove `RenameColumns` and remove the column
names argument from `SimpleProject` (simplifying some execbuilder code).

We also fix a bug in `ConstructSimpleProject` where we weren't taking the
`PlanToStreamColMap` into account when building the projection.

Release note: None

Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in f21b856 Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants