Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc,security: allow setting the tenant ID via gRPC metadata #96153

Merged
merged 2 commits into from
Feb 4, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 29, 2023

Fixes #75795.
Fixes #91996.
Epic: CRDB-14537
Informs #77173.

When running with shared-process multitenancy (with secondary tenant
servers running in the same process as the KV node), we want to allow
tenant servers to dial RPC connections with the same TLS client cert
as the KV node.

To make this possible, it becomes necessary for a RPC client
to identify itself as a secondary tenant by another mean than
the CN field in the TLS client cert.

This patch does by introducing a new optional gRPC metadata header,
"client-tenant".

  • When absent, we continue to use the tenant ID in the client
    cert as the claimed tenant identity of the client,
    as previously.

  • When present, it is used as the claimed tenant identity of the
    client. In that case, we allow two TLS situations:

    • either the client is also using a client tenant cert,
      in which case we verify that the tenant ID in the cert
      matches the one in the metadata;

    • or, the client is using a regular client TLS cert,
      in which case we verify that it is using a 'root' or
      'node' cert, since only these principals are allowed
      to perform RPCs in the first place.

The authentication rules are summarized in the following table.

This matrix was designed with the following properties:

  • a rogue client cannot get more access by adding gRPC metadata than
    it would have if it didn't pass gRPC metadata at all
    . This can be
    checked by observing that for a given TLS client cert, the addition of
    gRPC metadata always results in authn and authz rules that are at
    least as restrictive.

  • the gRPC metadata remains optional, so as to allow previous version
    SQL pods to continue to connect with just a valid TLS cert.

+------------------+---------------+-------------------------------------+--------------------------------------+
|                  |               | Server is system tenant             | Server is secondary tenant           |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| TLS client cert  | gRPC metadata | Authn result        | Authz rule    | Authn result            | Authz rule |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| Tenant client    | None          | OK                  | tenant-filter | OK if client tenant     | allow      |
|                  |               |                     |               | ID matches server       |            |
|                  |               |                     |               |                         |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| `root` or `node` |               | OK                  | allow         | OK if user scope        | allow      |
| client           |               |                     |               | maches server tenant ID |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| other client     |               | deny                | N/A           | deny                    | N/A        |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| Tenant client    | Client        | OK if TLS tenant ID | tenant-filter | OK if TLS, MD and       | allow      |
|                  | specified     | matches MD          |               | server tenant IDs match |            |
+------------------+ tenant ID     +---------------------+---------------+-------------------------+------------+
| `root` or `node` |               | OK                  | tenant-filter | OK if MD and server     | allow      |
| client           |               |                     |               | tenant IDs match        |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| other client     |               | deny                | N/A           | deny                    | N/A        |
+------------------+---------------+---------------------+---------------+-------------------------+------------+

Release note: None

@knz knz requested review from a team as code owners January 29, 2023 12:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20230129-tenant-md branch 5 times, most recently from 8cc535b to 48b9c74 Compare January 29, 2023 18:34
@knz knz requested a review from a team January 29, 2023 18:34
@knz knz force-pushed the 20230129-tenant-md branch from 48b9c74 to a51599d Compare January 29, 2023 18:50
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this. Just one question about whether we need to do any due diligence around other locations were we construct outgoing context.

// gateway forward to n2, we don't want n1 to authenticate itself
// as "tenant 10" to n2. At this point, the tenant information
// should not be used any more.
delete(md, rpc.ClientTIDMetadataHeaderKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we audit other calls to NewOutgoingContext?

pkg/rpc/auth.go Outdated
Comment on lines 135 to 137
if val := metadata.ValueFromIncomingContext(ctx, ClientTIDMetadataHeaderKey); len(val) > 0 {
var err error
wantedTenantID, err = tenantIDFromString(val[0], "gRPC metadata")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[💭 ] I could see having a method in rpc/context for getting and removing the client tID from the context rather than exposing the header key and having the caller do the manipulation. Fine as-is though.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All commits except for the last are from #96207

Not the first one.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @knz)


-- commits line 2 at r4:
Does this fix #95465 ?


-- commits line 39 at r4:
does this patch mean that shared-proc tenants now work in insecure clusters? Or are they already working today somehow?


pkg/rpc/auth.go line 155 at r1 (raw file):

			// serve requests for a client coming from tenant 456).
			if tid != a.tenant.tenantID {
				return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)

consider not leaking the server's tenant ID in this error, as a matter of principle


pkg/rpc/auth.go line 158 at r1 (raw file):

			}

			// We return an unset tenant ID, to bypass authorization checks:

I found these semantics (sometimes returning an empty tenant to mean "authorized to do anything") to be dubious, particularly since the method is called authenticate. In other words, I think the distinction between this authenticate and the authorization step later on confusing. At the very least, consider expanding the comment on this function saying that a tenant ID is returned only for KV nodes (or generally expand that comment; the logic in this function is convoluted enough that I'm not even sure how to summarize it or what to suggest).

Also, I can't tell if the cases where an empty tenant ID is returned match the if localRequest {} case at the top of the function.


pkg/rpc/auth.go line 208 at r4 (raw file):

	}

	// Here are the remaining cases:

Before, there was a checkRootOrNodeInScope() call that was clearly guarding this seemingly dangerous return roachpb.TenantID{}, nil. Now, I can't really tell any more who falls through. It'd be good if we could structure the code to make that more obvious. Unfortunately it's too convoluted for me to make a simple suggestion.


pkg/rpc/context.go line 676 at r4 (raw file):

}

// tenantClientCred is responsible for passing the tenant ID as

I thought I convinced you of 80 cols :)


pkg/rpc/context.go line 1739 at r4 (raw file):

	if rpcCtx.clientCreds != nil {
		dialOpts = append(dialOpts, grpc.WithPerRPCCredentials(rpcCtx.clientCreds))

From what I remember, uses of gRPC metadata are quite expensive in terms of memory allocation. Consider running BenchmarkSQL or such to see if it shows anything.
I think it'd be good if we only used this metadata when we have to. I think we don't have to when calling to the local node (either through the internalClientAdapter or through gRPC), or when using a client certificate.


pkg/server/status.go line 139 at r4 (raw file):

we don't want n1 to authenticate itself as "tenant 10" to n2. At this point, the tenant information

Why not?
This propafateGatewayMetadata function is used by RPCs that do fan-out, right? Wouldn't it make sense for all the nodes to to treat the request as coming from a tenant?

@knz knz force-pushed the 20230129-tenant-md branch from a51599d to b4f4d6b Compare January 30, 2023 23:03
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the first one.

That was already merged as part of #96152.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, and @stevendanna)


-- commits line 2 at r4:

Previously, andreimatei (Andrei Matei) wrote…

Does this fix #95465 ?

It does! I had forgotten you had filed the issue.
This is PR #96152 btw.


-- commits line 39 at r4:

Previously, andreimatei (Andrei Matei) wrote…

does this patch mean that shared-proc tenants now work in insecure clusters? Or are they already working today somehow?

This is one half of the puzzle. The other half will be fixing this issue: #96215.
I plan to do this together with @stevendanna some time this week.


pkg/rpc/auth.go line 155 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider not leaking the server's tenant ID in this error, as a matter of principle

Both tenant IDs in this error message are coming from the client.

However there's another one where your comment applies. I'm changing it as per your recommendation.


pkg/rpc/auth.go line 158 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I found these semantics (sometimes returning an empty tenant to mean "authorized to do anything") to be dubious, particularly since the method is called authenticate. In other words, I think the distinction between this authenticate and the authorization step later on confusing. At the very least, consider expanding the comment on this function saying that a tenant ID is returned only for KV nodes (or generally expand that comment; the logic in this function is convoluted enough that I'm not even sure how to summarize it or what to suggest).

Also, I can't tell if the cases where an empty tenant ID is returned match the if localRequest {} case at the top of the function.

I have reworked this code to use named/enum types and split the code into multiple functions to clarify. See if you like it better.


pkg/rpc/auth.go line 208 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Before, there was a checkRootOrNodeInScope() call that was clearly guarding this seemingly dangerous return roachpb.TenantID{}, nil. Now, I can't really tell any more who falls through. It'd be good if we could structure the code to make that more obvious. Unfortunately it's too convoluted for me to make a simple suggestion.

I have reworked this to clarify. See if you like it better.


pkg/rpc/context.go line 1739 at r4 (raw file):

From what I remember, uses of gRPC metadata are quite expensive in terms of memory allocation.

My manual inspection of the gRPC code suggests that you are right, because of a poorly designed metadata.FromIncomingContext. However, the proper way to go about this is to patch gRPC to add an alternative impl to FromIncomingContext which avoids these extra allocations (it clones the metadata map that is already in the context. We could consume it as-is instead.). This would be a better use of our time than creating many more conditional paths around here. I plan to look into this during stability period.


pkg/server/status.go line 139 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

we don't want n1 to authenticate itself as "tenant 10" to n2. At this point, the tenant information

Why not?
This propafateGatewayMetadata function is used by RPCs that do fan-out, right? Wouldn't it make sense for all the nodes to to treat the request as coming from a tenant?

This is a good question!

Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.

But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.

Because it's such a good question, I extended the comment to explain more.


pkg/server/status.go line 142 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Should we audit other calls to NewOutgoingContext?

I have audited them all.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)


pkg/rpc/auth.go line 155 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Both tenant IDs in this error message are coming from the client.

However there's another one where your comment applies. I'm changing it as per your recommendation.

a.tenant.tenantID is coming from the client? That's because we went through a demultiplexer in order to even get to this server? Remind me pls, how does the client specify what server it wants to talk to?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, and @stevendanna)


pkg/rpc/auth.go line 155 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

a.tenant.tenantID is coming from the client? That's because we went through a demultiplexer in order to even get to this server? Remind me pls, how does the client specify what server it wants to talk to?

I don't know what you're talking about. In my reviewable session, I see your comment above this code:
image.png

@knz knz added the A-multitenancy Related to multi-tenancy label Jan 30, 2023
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)


pkg/rpc/auth.go line 155 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I don't know what you're talking about. In my reviewable session, I see your comment above this code:
image.png

I think maybe you're not looking at the commit on which I commented? The comment is on r1, on the line that reads

return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)

@knz
Copy link
Contributor Author

knz commented Jan 31, 2023

I think maybe you're not looking at the commit on which I commented? The comment is on r1, on the line that reads

return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)

Gotcha. Yes i changed that as you suggested.

@knz knz requested a review from andreimatei January 31, 2023 00:02
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)


pkg/server/status.go line 139 at r4 (raw file):

Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.

But this function doesn't seem to be used for KV calling KV, is it? At least not when calling Batch, which is the case where I see why propagating a tenant's identity would be bad (because for example reading from the meta ranges would fail). For RPCs that follow the pattern of having a NodeID in the request and forwarding the request to that node (or fanning out to all nodes if NodeID is zero), it seems to me that maintaining the tenants identity would be a good thing (so that each node can individually do its own auth rather than rely on only the first KV node to do it).

But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.

Well but it'd probably be cheaper for that function to check whether the desired tenant ID is already in the ctx, and not add it again if it is.


Does this function serve any purpose? I think this function would really benefit from a comment explaining what it's about. I think it was introduced in conjunction with server.remote_debugging.mode which has since gone away; I think it was about propagating a piece of metadata injected by the grpc-gateway that was to be checked on all nodes. I suspect we can get rid of the function and then we don't have to argue about what it should propagate and what it shouldn't.
If the function serves some purpose, I would really benefit from a comment explaining what the deal is.

@knz knz force-pushed the 20230129-tenant-md branch from b4f4d6b to 021383a Compare February 1, 2023 00:05
craig bot pushed a commit that referenced this pull request Feb 1, 2023
95582: loqrecovery,server: apply staged loqrecovery plans on start r=erikgrinaker a=aliher1911

This commit adds loss of quorum recovery plan application functionality
to server startup. Plans are staged during application phase and then
applied when rolling restart of affected nodes is performed.
Plan application peforms same actions that are performed by debug
recovery apply-plan on offline storage. It is also updating loss of
quorum recovery status stored in a store local key.

Release note: None

Fixes #93121

96126: sql: refactor FunctionProperties and Overload r=mgartner a=mgartner

#### sql: remove FuncExpr.IsGeneratorApplication

`FuncExpr.IsGeneratorApplication` has been removed and its single usage
has been replaced with with `FuncExpr.IsGeneratorClass`.

Release note: None

#### sql: move FunctionClass from FunctionProperties to Overload

`FunctionProperties` are attached to a `FunctionDefinition`, which is
simply a collection of overloads that share the same name. Most of the
fields in `FunctionProperties` are, however, overload-specific. They
should be moved to the `Overload` struct. In the long-term,
the hierarchy of function definitions, each with child function overloads,
should be flattened to a just overloads.

This commit takes one step in this direction.

Epic: CRDB-20370

Release note: None


96207: rpc,security: simplify and enhance the client cert validation r=stevendanna a=knz

Fixes #96174.
Prerequisite for #96153.
Epic: CRDB-14537

TLDR: this change enhances various aspects of TLS client cert
validation. Between other things, it makes the error clearer in case
of authentication failure.

Example, before:
```
ERROR: requested user demo is not authorized for tenant {2}
```

Example, after:
```
ERROR: certificate authentication failed for user "demo"
DETAIL: This is tenant system; cert is valid for "root" on all tenants.
```


Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
}
return authnSuccessPeerIsPrivileged{}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the previous discussion is around it being hard to see what cases lead us to hit this case. I think part of the difficulty is tracking when wantedTenantID is set and unset. I wonder if a simple refactoring might help (warning: code written directly in browser). I still don't love it, but it is nice to be able to directly scan for return authnSuccessPeerIsPrivileged. Anyway, perhaps this is just pushing peas around the plate.

// authenticate verifies the credentials of the client and performs
// some consistency check with the information provided.
func (a kvAuth) authenticate(ctx context.Context) (authnResult, error) {
	var ar authnResult
	if clientTenantID, localRequest := grpcutil.IsLocalRequestContext(ctx); localRequest {
		var err error
		ar, err = authenticateLocalRequest(ctx, clientTenantID)
		if err != nil {
			return nil, err
		}
	} else {
		var err error
		ar, err = authenticateNeworkRequest(ctx)
		if err != nil {
			return nil, err
		}
	}

	switch res := ar.(type) {
	case authnSuccessPeerIsTenantServer:
		if res != a.tenant.tenantID {
			log.Ops.Infof(ctx, "rejected incoming request from tenant %d (misconfiguration?)", wantedTenantID)
			return nil, authErrorf("client tenant identity (%v) does not match server", wantedTenantID)
		}
	case authnSuccessPeerIsPrivileged:
	default:
		return nil, errors.AssertionFailedf("programming error: unhandled case %T", ar)
	}

	return ar, nil

}

// Deal with local requests done through the
// internalClientAdapter. There's no TLS for these calls, so the
// regular authentication code path doesn't apply. The clientTenantID
// should be the result of a call to grpcutil.IsLocalRequestContext.
func (a kvAuth) authenticateLocalRequest(ctx context.Context, clientTenantID roachpb.TenantID) (authnResult, error) {
	if clientTenantID.IsSystem() {
		return authnSuccessPeerIsPrivileged{}, nil
	}
	
	// Sanity check: verify that we do not also have gRPC network credentials
	// in the context. This would indicate that metadata was improperly propagated.
	maybeTid, err := getTenantIDFromNetworkCredentials(ctx)
	if err != nil || maybeTid.IsSet() {
		return nil, authErrorf("programming error: network credentials in internal adapter request (%v, %v)", maybeTid, err)
	}
	return authnSuccessPeerIsTenantServer(clientTenantID), nil
}

// authenticateRemoteRequest authenticates requests made over a TLS connection.
func (a kvAuth) authenticateRemoteRequest(ctx context.Context) (authnResult, error) {
	// We will need to look at the TLS cert in any case, so
	// extract it first.
	clientCert, err := getClientCert(ctx)
	if err != nil {
		return nil, err
	}
	tenantIDFromCredentials, err := getTenantIDFromNetworkCredentials(ctx)
	if err != nil {
		return nil, authErrorf("client provided invalid tenant ID: %v", err)
	}
	
	// Did the client peer use a tenant client cert?
	if security.IsTenantCertificate(clientCert) {
		// If the peer is using a client tenant cert, in any case we
		// validate the tenant ID stored in the CN for correctness.
		tlsID, err := tenantIDFromString(clientCert.Subject.CommonName, "Common Name (CN)")
		if err != nil {
			return nil, err
		}

		// If the peer is using a TenantCertificate and also
		// provided a tenant ID via gRPC metadata, they must
		// match.
		if tenantIDFromCredentials.IsSet() && tenantIDFromCredentials != tlsID {
			return nil, authErrorf(
				"client wants to authenticate as tenant %v, but is using TLS cert for tenant %v",
				tenantIDFromCredentials, tlsID)
		}
		return authnSuccessPeerIsTenantServer(tlsID), nil
	} else {
		// We are using TLS, but the peer is not using a client tenant
		// cert. In that case, we only allow RPCs if the principal is
		// 'node' or 'root' and the tenant scope in the cert matches
		// this server (either the cert has scope "global" or its scope
		// tenant ID matches our own).
		//
		// TODO(benesch): the vast majority of RPCs should be limited to just
		// NodeUser. This is not a security concern, as RootUser has access to
		// read and write all data, merely good hygiene. For example, there is
		// no reason to permit the root user to send raw Raft RPCs.
		certUserScope, err := security.GetCertificateUserScope(clientCert)
		if err != nil {
			return nil, err
		}
		if err := checkRootOrNodeInScope(certUserScope, a.tenant.tenantID); err != nil {
			return nil, err
		}
		
		if tenantIDFromCredentials.IsSet() {
			return authnSuccessPeerIsTenantServer(tlsID), nil
		} else {
			return authnSuccessPeerIsPrivileged{}, nil
		}
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad. Doing this simplification in a separate PR: #96338

@knz knz force-pushed the 20230129-tenant-md branch 3 times, most recently from 6cbd72c to 97f60e7 Compare February 1, 2023 22:01
@knz knz requested review from dhartunian and andreimatei February 1, 2023 22:08
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, @jeffswenson, and @stevendanna)


pkg/rpc/auth.go line 155 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think maybe you're not looking at the commit on which I commented? The comment is on r1, on the line that reads

return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)

Done.


pkg/rpc/auth.go line 137 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[💭 ] I could see having a method in rpc/context for getting and removing the client tID from the context rather than exposing the header key and having the caller do the manipulation. Fine as-is though.

Done.


pkg/server/status.go line 139 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.

But this function doesn't seem to be used for KV calling KV, is it? At least not when calling Batch, which is the case where I see why propagating a tenant's identity would be bad (because for example reading from the meta ranges would fail). For RPCs that follow the pattern of having a NodeID in the request and forwarding the request to that node (or fanning out to all nodes if NodeID is zero), it seems to me that maintaining the tenants identity would be a good thing (so that each node can individually do its own auth rather than rely on only the first KV node to do it).

But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.

Well but it'd probably be cheaper for that function to check whether the desired tenant ID is already in the ctx, and not add it again if it is.


Does this function serve any purpose? I think this function would really benefit from a comment explaining what it's about. I think it was introduced in conjunction with server.remote_debugging.mode which has since gone away; I think it was about propagating a piece of metadata injected by the grpc-gateway that was to be checked on all nodes. I suspect we can get rid of the function and then we don't have to argue about what it should propagate and what it shouldn't.
If the function serves some purpose, I would really benefit from a comment explaining what the deal is.

We've discussed this elsewhere with David and a suggestion was made to do this modification right after authentication, instead of here. I've adapted the PR accordingly.
(And added an explanatory comment here as requested by Andrei)

@knz knz force-pushed the 20230129-tenant-md branch 3 times, most recently from 56d514d to 334d1d7 Compare February 3, 2023 13:30
@knz
Copy link
Contributor Author

knz commented Feb 3, 2023

I have updated/upgraded this PR based off the changes in #96318, #96451, and some newfound lessons from #96427.

PTAL

In the previous change, the logging tag was just "tenant".
This is somewhat confusing/ambiguous with the tenant identity
of the server itself.

Release note: None
@knz knz force-pushed the 20230129-tenant-md branch from 334d1d7 to 82cbb17 Compare February 3, 2023 21:07
@knz
Copy link
Contributor Author

knz commented Feb 3, 2023

RFAL, Andrei asked me to not be too prescriptive about the context upon authentication, and instead only erase the excess metadata in the internal client adapter. I've updated the PR accordingly.

TLDR: this patch introduces a mechanism through which a secondary
tenant server can identify itself to its peer through a RPC using a
different mechanism than a TLS "tenant client" certificate.

When running with shared-process multitenancy (with secondary tenant
servers running in the same process as the KV node), we want to allow
tenant servers to dial RPC connections with the same TLS client cert
as the KV node.

To make this possible, it becomes necessary for a RPC client
to identify itself as a secondary tenant by another mean than
the CN field in the TLS client cert.

This patch does by introducing a new optional gRPC metadata header,
"client-tenant".

- When absent, we continue to use the tenant ID in the client
  cert as the claimed tenant identity of the client,
  as previously.

- When present, it is used as the claimed tenant identity of the
  client. In that case, we allow two TLS situations:

  - either the client is _also_ using a client tenant cert,
    in which case we verify that the tenant ID in the cert
    matches the one in the metadata;

  - or, the client is using a regular client TLS cert,
    in which case we verify that it is using a 'root' or
    'node' cert, since only these principals are allowed
    to perform RPCs in the first place.

The authentication rules are summarized in the following table.

This matrix was designed with the following properties:

- *a rogue client cannot get more access by adding gRPC metadata than
  it would have if it didn't pass gRPC metadata at all*. This can be
  checked by observing that for a given TLS client cert, the addition of
  gRPC metadata always results in authn and authz rules that are at
  least as restrictive.

- the gRPC metadata remains optional, so as to allow previous version
  SQL pods to continue to connect with just a valid TLS cert.

```
+------------------+---------------+-------------------------------------+--------------------------------------+
|                  |               | Server is system tenant             | Server is secondary tenant           |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| TLS client cert  | gRPC metadata | Authn result        | Authz rule    | Authn result            | Authz rule |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| Tenant client    | None          | OK                  | tenant-filter | OK if client tenant     | allow      |
|                  |               |                     |               | ID matches server       |            |
|                  |               |                     |               |                         |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| `root` or `node` |               | OK                  | allow         | OK if user scope        | allow      |
| client           |               |                     |               | maches server tenant ID |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| other client     |               | deny                | N/A           | deny                    | N/A        |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
| Tenant client    | Client        | OK if TLS tenant ID | tenant-filter | OK if TLS, MD and       | allow      |
|                  | specified     | matches MD          |               | server tenant IDs match |            |
+------------------+ tenant ID     +---------------------+---------------+-------------------------+------------+
| `root` or `node` |               | OK                  | tenant-filter | OK if MD and server     | allow      |
| client           |               |                     |               | tenant IDs match        |            |
+------------------+               +---------------------+---------------+-------------------------+------------+
| other client     |               | deny                | N/A           | deny                    | N/A        |
+------------------+---------------+---------------------+---------------+-------------------------+------------+
```

Release note: None
@knz knz force-pushed the 20230129-tenant-md branch from 82cbb17 to f3fa310 Compare February 3, 2023 21:11
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: just have a question re: stripping metadata when running rpcs over network from tenant->kv.

the code iterations really made things a lot more readable. now that I walked through the call chain in auth.go things are much clearer.

Reviewed 5 of 9 files at r12, 4 of 4 files at r13, 3 of 11 files at r14, 8 of 8 files at r15, 3 of 7 files at r16, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, @knz, and @stevendanna)


pkg/rpc/context.go line 822 at r17 (raw file):

			// metadata in the context, but we need to get rid of it
			// before we let the call go through KV.
			ctx = grpcutil.ClearIncomingContext(ctx)

Is this also done in the case where the tenant is out-of-process? This code path is for local calls only, right?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, @jeffswenson, and @stevendanna)


pkg/rpc/context.go line 822 at r17 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Is this also done in the case where the tenant is out-of-process? This code path is for local calls only, right?

This code path is only done for local calls, yes.

We do not need to do this for network calls. gRPC differentiates between "incoming contexts" and "outgoing contexts". The data from an incoming context does not carry over to an outgoing context in a network call automatically.
(Which is why we have this function forwardSQLIdentityThroughRPCCalls in pkg/server to copy the SQL username from the incoming metadata to outgoing context.)

@knz
Copy link
Contributor Author

knz commented Feb 3, 2023

TFYR!

bors r=dhartunian

@craig
Copy link
Contributor

craig bot commented Feb 4, 2023

Build succeeded:

@craig craig bot merged commit 581560b into cockroachdb:master Feb 4, 2023
@knz knz deleted the 20230129-tenant-md branch February 4, 2023 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy
Projects
None yet
5 participants