-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc,security: allow setting the tenant ID via gRPC metadata #96153
Conversation
8cc535b
to
48b9c74
Compare
48b9c74
to
a51599d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for jumping on this. Just one question about whether we need to do any due diligence around other locations were we construct outgoing context.
pkg/server/status.go
Outdated
// gateway forward to n2, we don't want n1 to authenticate itself | ||
// as "tenant 10" to n2. At this point, the tenant information | ||
// should not be used any more. | ||
delete(md, rpc.ClientTIDMetadataHeaderKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we audit other calls to NewOutgoingContext?
pkg/rpc/auth.go
Outdated
if val := metadata.ValueFromIncomingContext(ctx, ClientTIDMetadataHeaderKey); len(val) > 0 { | ||
var err error | ||
wantedTenantID, err = tenantIDFromString(val[0], "gRPC metadata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[💭 ] I could see having a method in rpc/context for getting and removing the client tID from the context rather than exposing the header key and having the caller do the manipulation. Fine as-is though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All commits except for the last are from #96207
Not the first one.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson and @knz)
-- commits
line 2 at r4:
Does this fix #95465 ?
-- commits
line 39 at r4:
does this patch mean that shared-proc tenants now work in insecure clusters? Or are they already working today somehow?
pkg/rpc/auth.go
line 155 at r1 (raw file):
// serve requests for a client coming from tenant 456). if tid != a.tenant.tenantID { return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)
consider not leaking the server's tenant ID in this error, as a matter of principle
pkg/rpc/auth.go
line 158 at r1 (raw file):
} // We return an unset tenant ID, to bypass authorization checks:
I found these semantics (sometimes returning an empty tenant to mean "authorized to do anything") to be dubious, particularly since the method is called authenticate
. In other words, I think the distinction between this authenticate
and the authorization step later on confusing. At the very least, consider expanding the comment on this function saying that a tenant ID is returned only for KV nodes (or generally expand that comment; the logic in this function is convoluted enough that I'm not even sure how to summarize it or what to suggest).
Also, I can't tell if the cases where an empty tenant ID is returned match the if localRequest {}
case at the top of the function.
pkg/rpc/auth.go
line 208 at r4 (raw file):
} // Here are the remaining cases:
Before, there was a checkRootOrNodeInScope()
call that was clearly guarding this seemingly dangerous return roachpb.TenantID{}, nil
. Now, I can't really tell any more who falls through. It'd be good if we could structure the code to make that more obvious. Unfortunately it's too convoluted for me to make a simple suggestion.
pkg/rpc/context.go
line 676 at r4 (raw file):
} // tenantClientCred is responsible for passing the tenant ID as
I thought I convinced you of 80 cols :)
pkg/rpc/context.go
line 1739 at r4 (raw file):
if rpcCtx.clientCreds != nil { dialOpts = append(dialOpts, grpc.WithPerRPCCredentials(rpcCtx.clientCreds))
From what I remember, uses of gRPC metadata are quite expensive in terms of memory allocation. Consider running BenchmarkSQL
or such to see if it shows anything.
I think it'd be good if we only used this metadata when we have to. I think we don't have to when calling to the local node (either through the internalClientAdapter
or through gRPC), or when using a client certificate.
pkg/server/status.go
line 139 at r4 (raw file):
we don't want n1 to authenticate itself as "tenant 10" to n2. At this point, the tenant information
Why not?
This propafateGatewayMetadata
function is used by RPCs that do fan-out, right? Wouldn't it make sense for all the nodes to to treat the request as coming from a tenant?
a51599d
to
b4f4d6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the first one.
That was already merged as part of #96152.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, and @stevendanna)
Previously, andreimatei (Andrei Matei) wrote…
Does this fix #95465 ?
It does! I had forgotten you had filed the issue.
This is PR #96152 btw.
Previously, andreimatei (Andrei Matei) wrote…
does this patch mean that shared-proc tenants now work in insecure clusters? Or are they already working today somehow?
This is one half of the puzzle. The other half will be fixing this issue: #96215.
I plan to do this together with @stevendanna some time this week.
pkg/rpc/auth.go
line 155 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider not leaking the server's tenant ID in this error, as a matter of principle
Both tenant IDs in this error message are coming from the client.
However there's another one where your comment applies. I'm changing it as per your recommendation.
pkg/rpc/auth.go
line 158 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I found these semantics (sometimes returning an empty tenant to mean "authorized to do anything") to be dubious, particularly since the method is called
authenticate
. In other words, I think the distinction between thisauthenticate
and the authorization step later on confusing. At the very least, consider expanding the comment on this function saying that a tenant ID is returned only for KV nodes (or generally expand that comment; the logic in this function is convoluted enough that I'm not even sure how to summarize it or what to suggest).Also, I can't tell if the cases where an empty tenant ID is returned match the
if localRequest {}
case at the top of the function.
I have reworked this code to use named/enum types and split the code into multiple functions to clarify. See if you like it better.
pkg/rpc/auth.go
line 208 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Before, there was a
checkRootOrNodeInScope()
call that was clearly guarding this seemingly dangerousreturn roachpb.TenantID{}, nil
. Now, I can't really tell any more who falls through. It'd be good if we could structure the code to make that more obvious. Unfortunately it's too convoluted for me to make a simple suggestion.
I have reworked this to clarify. See if you like it better.
pkg/rpc/context.go
line 1739 at r4 (raw file):
From what I remember, uses of gRPC metadata are quite expensive in terms of memory allocation.
My manual inspection of the gRPC code suggests that you are right, because of a poorly designed metadata.FromIncomingContext
. However, the proper way to go about this is to patch gRPC to add an alternative impl to FromIncomingContext
which avoids these extra allocations (it clones the metadata map that is already in the context. We could consume it as-is instead.). This would be a better use of our time than creating many more conditional paths around here. I plan to look into this during stability period.
pkg/server/status.go
line 139 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
we don't want n1 to authenticate itself as "tenant 10" to n2. At this point, the tenant information
Why not?
ThispropafateGatewayMetadata
function is used by RPCs that do fan-out, right? Wouldn't it make sense for all the nodes to to treat the request as coming from a tenant?
This is a good question!
Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.
But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.
Because it's such a good question, I extended the comment to explain more.
pkg/server/status.go
line 142 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Should we audit other calls to NewOutgoingContext?
I have audited them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)
pkg/rpc/auth.go
line 155 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Both tenant IDs in this error message are coming from the client.
However there's another one where your comment applies. I'm changing it as per your recommendation.
a.tenant.tenantID
is coming from the client? That's because we went through a demultiplexer in order to even get to this server? Remind me pls, how does the client specify what server it wants to talk to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, and @stevendanna)
pkg/rpc/auth.go
line 155 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
a.tenant.tenantID
is coming from the client? That's because we went through a demultiplexer in order to even get to this server? Remind me pls, how does the client specify what server it wants to talk to?
I don't know what you're talking about. In my reviewable session, I see your comment above this code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)
pkg/rpc/auth.go
line 155 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I don't know what you're talking about. In my reviewable session, I see your comment above this code:
I think maybe you're not looking at the commit on which I commented? The comment is on r1
, on the line that reads
return roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)
Gotcha. Yes i changed that as you suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @stevendanna)
pkg/server/status.go
line 139 at r4 (raw file):
Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.
But this function doesn't seem to be used for KV calling KV, is it? At least not when calling Batch
, which is the case where I see why propagating a tenant's identity would be bad (because for example reading from the meta ranges would fail). For RPCs that follow the pattern of having a NodeID
in the request and forwarding the request to that node (or fanning out to all nodes if NodeID
is zero), it seems to me that maintaining the tenants identity would be a good thing (so that each node can individually do its own auth rather than rely on only the first KV node to do it).
But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.
Well but it'd probably be cheaper for that function to check whether the desired tenant ID is already in the ctx, and not add it again if it is.
Does this function serve any purpose? I think this function would really benefit from a comment explaining what it's about. I think it was introduced in conjunction with server.remote_debugging.mode
which has since gone away; I think it was about propagating a piece of metadata injected by the grpc-gateway that was to be checked on all nodes. I suspect we can get rid of the function and then we don't have to argue about what it should propagate and what it shouldn't.
If the function serves some purpose, I would really benefit from a comment explaining what the deal is.
b4f4d6b
to
021383a
Compare
95582: loqrecovery,server: apply staged loqrecovery plans on start r=erikgrinaker a=aliher1911 This commit adds loss of quorum recovery plan application functionality to server startup. Plans are staged during application phase and then applied when rolling restart of affected nodes is performed. Plan application peforms same actions that are performed by debug recovery apply-plan on offline storage. It is also updating loss of quorum recovery status stored in a store local key. Release note: None Fixes #93121 96126: sql: refactor FunctionProperties and Overload r=mgartner a=mgartner #### sql: remove FuncExpr.IsGeneratorApplication `FuncExpr.IsGeneratorApplication` has been removed and its single usage has been replaced with with `FuncExpr.IsGeneratorClass`. Release note: None #### sql: move FunctionClass from FunctionProperties to Overload `FunctionProperties` are attached to a `FunctionDefinition`, which is simply a collection of overloads that share the same name. Most of the fields in `FunctionProperties` are, however, overload-specific. They should be moved to the `Overload` struct. In the long-term, the hierarchy of function definitions, each with child function overloads, should be flattened to a just overloads. This commit takes one step in this direction. Epic: CRDB-20370 Release note: None 96207: rpc,security: simplify and enhance the client cert validation r=stevendanna a=knz Fixes #96174. Prerequisite for #96153. Epic: CRDB-14537 TLDR: this change enhances various aspects of TLS client cert validation. Between other things, it makes the error clearer in case of authentication failure. Example, before: ``` ERROR: requested user demo is not authorized for tenant {2} ``` Example, after: ``` ERROR: certificate authentication failed for user "demo" DETAIL: This is tenant system; cert is valid for "root" on all tenants. ``` Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
} | ||
return authnSuccessPeerIsPrivileged{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the previous discussion is around it being hard to see what cases lead us to hit this case. I think part of the difficulty is tracking when wantedTenantID is set and unset. I wonder if a simple refactoring might help (warning: code written directly in browser). I still don't love it, but it is nice to be able to directly scan for return authnSuccessPeerIsPrivileged
. Anyway, perhaps this is just pushing peas around the plate.
// authenticate verifies the credentials of the client and performs
// some consistency check with the information provided.
func (a kvAuth) authenticate(ctx context.Context) (authnResult, error) {
var ar authnResult
if clientTenantID, localRequest := grpcutil.IsLocalRequestContext(ctx); localRequest {
var err error
ar, err = authenticateLocalRequest(ctx, clientTenantID)
if err != nil {
return nil, err
}
} else {
var err error
ar, err = authenticateNeworkRequest(ctx)
if err != nil {
return nil, err
}
}
switch res := ar.(type) {
case authnSuccessPeerIsTenantServer:
if res != a.tenant.tenantID {
log.Ops.Infof(ctx, "rejected incoming request from tenant %d (misconfiguration?)", wantedTenantID)
return nil, authErrorf("client tenant identity (%v) does not match server", wantedTenantID)
}
case authnSuccessPeerIsPrivileged:
default:
return nil, errors.AssertionFailedf("programming error: unhandled case %T", ar)
}
return ar, nil
}
// Deal with local requests done through the
// internalClientAdapter. There's no TLS for these calls, so the
// regular authentication code path doesn't apply. The clientTenantID
// should be the result of a call to grpcutil.IsLocalRequestContext.
func (a kvAuth) authenticateLocalRequest(ctx context.Context, clientTenantID roachpb.TenantID) (authnResult, error) {
if clientTenantID.IsSystem() {
return authnSuccessPeerIsPrivileged{}, nil
}
// Sanity check: verify that we do not also have gRPC network credentials
// in the context. This would indicate that metadata was improperly propagated.
maybeTid, err := getTenantIDFromNetworkCredentials(ctx)
if err != nil || maybeTid.IsSet() {
return nil, authErrorf("programming error: network credentials in internal adapter request (%v, %v)", maybeTid, err)
}
return authnSuccessPeerIsTenantServer(clientTenantID), nil
}
// authenticateRemoteRequest authenticates requests made over a TLS connection.
func (a kvAuth) authenticateRemoteRequest(ctx context.Context) (authnResult, error) {
// We will need to look at the TLS cert in any case, so
// extract it first.
clientCert, err := getClientCert(ctx)
if err != nil {
return nil, err
}
tenantIDFromCredentials, err := getTenantIDFromNetworkCredentials(ctx)
if err != nil {
return nil, authErrorf("client provided invalid tenant ID: %v", err)
}
// Did the client peer use a tenant client cert?
if security.IsTenantCertificate(clientCert) {
// If the peer is using a client tenant cert, in any case we
// validate the tenant ID stored in the CN for correctness.
tlsID, err := tenantIDFromString(clientCert.Subject.CommonName, "Common Name (CN)")
if err != nil {
return nil, err
}
// If the peer is using a TenantCertificate and also
// provided a tenant ID via gRPC metadata, they must
// match.
if tenantIDFromCredentials.IsSet() && tenantIDFromCredentials != tlsID {
return nil, authErrorf(
"client wants to authenticate as tenant %v, but is using TLS cert for tenant %v",
tenantIDFromCredentials, tlsID)
}
return authnSuccessPeerIsTenantServer(tlsID), nil
} else {
// We are using TLS, but the peer is not using a client tenant
// cert. In that case, we only allow RPCs if the principal is
// 'node' or 'root' and the tenant scope in the cert matches
// this server (either the cert has scope "global" or its scope
// tenant ID matches our own).
//
// TODO(benesch): the vast majority of RPCs should be limited to just
// NodeUser. This is not a security concern, as RootUser has access to
// read and write all data, merely good hygiene. For example, there is
// no reason to permit the root user to send raw Raft RPCs.
certUserScope, err := security.GetCertificateUserScope(clientCert)
if err != nil {
return nil, err
}
if err := checkRootOrNodeInScope(certUserScope, a.tenant.tenantID); err != nil {
return nil, err
}
if tenantIDFromCredentials.IsSet() {
return authnSuccessPeerIsTenantServer(tlsID), nil
} else {
return authnSuccessPeerIsPrivileged{}, nil
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not bad. Doing this simplification in a separate PR: #96338
6cbd72c
to
97f60e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, @jeffswenson, and @stevendanna)
pkg/rpc/auth.go
line 155 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think maybe you're not looking at the commit on which I commented? The comment is on
r1
, on the line that readsreturn roachpb.TenantID{}, authErrorf("this tenant (%v) cannot serve requests from a server for tenant %v", a.tenant.tenantID, tid)
Done.
pkg/rpc/auth.go
line 137 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[💭 ] I could see having a method in rpc/context for getting and removing the client tID from the context rather than exposing the header key and having the caller do the manipulation. Fine as-is though.
Done.
pkg/server/status.go
line 139 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Generally this would be incorrect, because we don't only do fan-out for incoming requests. We also do fan-out to handle a request internally to the KV layer / system tenant, in response to a SQL statement.
But this function doesn't seem to be used for KV calling KV, is it? At least not when calling
Batch
, which is the case where I see why propagating a tenant's identity would be bad (because for example reading from the meta ranges would fail). For RPCs that follow the pattern of having aNodeID
in the request and forwarding the request to that node (or fanning out to all nodes ifNodeID
is zero), it seems to me that maintaining the tenants identity would be a good thing (so that each node can individually do its own auth rather than rely on only the first KV node to do it).But then even if we focus specifically on SQL-to-SQL request fanout (within the tenant), we still would not need to propagate the metadata here. This is because the outgoing RPC connection (as initiated by rpc.Dial) would re-inject the tenant ID in the outgoing request anyway.
Well but it'd probably be cheaper for that function to check whether the desired tenant ID is already in the ctx, and not add it again if it is.
Does this function serve any purpose? I think this function would really benefit from a comment explaining what it's about. I think it was introduced in conjunction with
server.remote_debugging.mode
which has since gone away; I think it was about propagating a piece of metadata injected by the grpc-gateway that was to be checked on all nodes. I suspect we can get rid of the function and then we don't have to argue about what it should propagate and what it shouldn't.
If the function serves some purpose, I would really benefit from a comment explaining what the deal is.
We've discussed this elsewhere with David and a suggestion was made to do this modification right after authentication, instead of here. I've adapted the PR accordingly.
(And added an explanatory comment here as requested by Andrei)
56d514d
to
334d1d7
Compare
In the previous change, the logging tag was just "tenant". This is somewhat confusing/ambiguous with the tenant identity of the server itself. Release note: None
334d1d7
to
82cbb17
Compare
RFAL, Andrei asked me to not be too prescriptive about the context upon authentication, and instead only erase the excess metadata in the internal client adapter. I've updated the PR accordingly. |
TLDR: this patch introduces a mechanism through which a secondary tenant server can identify itself to its peer through a RPC using a different mechanism than a TLS "tenant client" certificate. When running with shared-process multitenancy (with secondary tenant servers running in the same process as the KV node), we want to allow tenant servers to dial RPC connections with the same TLS client cert as the KV node. To make this possible, it becomes necessary for a RPC client to identify itself as a secondary tenant by another mean than the CN field in the TLS client cert. This patch does by introducing a new optional gRPC metadata header, "client-tenant". - When absent, we continue to use the tenant ID in the client cert as the claimed tenant identity of the client, as previously. - When present, it is used as the claimed tenant identity of the client. In that case, we allow two TLS situations: - either the client is _also_ using a client tenant cert, in which case we verify that the tenant ID in the cert matches the one in the metadata; - or, the client is using a regular client TLS cert, in which case we verify that it is using a 'root' or 'node' cert, since only these principals are allowed to perform RPCs in the first place. The authentication rules are summarized in the following table. This matrix was designed with the following properties: - *a rogue client cannot get more access by adding gRPC metadata than it would have if it didn't pass gRPC metadata at all*. This can be checked by observing that for a given TLS client cert, the addition of gRPC metadata always results in authn and authz rules that are at least as restrictive. - the gRPC metadata remains optional, so as to allow previous version SQL pods to continue to connect with just a valid TLS cert. ``` +------------------+---------------+-------------------------------------+--------------------------------------+ | | | Server is system tenant | Server is secondary tenant | +------------------+---------------+---------------------+---------------+-------------------------+------------+ | TLS client cert | gRPC metadata | Authn result | Authz rule | Authn result | Authz rule | +------------------+---------------+---------------------+---------------+-------------------------+------------+ | Tenant client | None | OK | tenant-filter | OK if client tenant | allow | | | | | | ID matches server | | | | | | | | | +------------------+ +---------------------+---------------+-------------------------+------------+ | `root` or `node` | | OK | allow | OK if user scope | allow | | client | | | | maches server tenant ID | | +------------------+ +---------------------+---------------+-------------------------+------------+ | other client | | deny | N/A | deny | N/A | +------------------+---------------+---------------------+---------------+-------------------------+------------+ | Tenant client | Client | OK if TLS tenant ID | tenant-filter | OK if TLS, MD and | allow | | | specified | matches MD | | server tenant IDs match | | +------------------+ tenant ID +---------------------+---------------+-------------------------+------------+ | `root` or `node` | | OK | tenant-filter | OK if MD and server | allow | | client | | | | tenant IDs match | | +------------------+ +---------------------+---------------+-------------------------+------------+ | other client | | deny | N/A | deny | N/A | +------------------+---------------+---------------------+---------------+-------------------------+------------+ ``` Release note: None
82cbb17
to
f3fa310
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just have a question re: stripping metadata when running rpcs over network from tenant->kv.
the code iterations really made things a lot more readable. now that I walked through the call chain in auth.go
things are much clearer.
Reviewed 5 of 9 files at r12, 4 of 4 files at r13, 3 of 11 files at r14, 8 of 8 files at r15, 3 of 7 files at r16, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @jeffswenson, @knz, and @stevendanna)
pkg/rpc/context.go
line 822 at r17 (raw file):
// metadata in the context, but we need to get rid of it // before we let the call go through KV. ctx = grpcutil.ClearIncomingContext(ctx)
Is this also done in the case where the tenant is out-of-process? This code path is for local calls only, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, @jeffswenson, and @stevendanna)
pkg/rpc/context.go
line 822 at r17 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Is this also done in the case where the tenant is out-of-process? This code path is for local calls only, right?
This code path is only done for local calls, yes.
We do not need to do this for network calls. gRPC differentiates between "incoming contexts" and "outgoing contexts". The data from an incoming context does not carry over to an outgoing context in a network call automatically.
(Which is why we have this function forwardSQLIdentityThroughRPCCalls
in pkg/server to copy the SQL username from the incoming metadata to outgoing context.)
TFYR! bors r=dhartunian |
Build succeeded: |
Fixes #75795.
Fixes #91996.
Epic: CRDB-14537
Informs #77173.
When running with shared-process multitenancy (with secondary tenant
servers running in the same process as the KV node), we want to allow
tenant servers to dial RPC connections with the same TLS client cert
as the KV node.
To make this possible, it becomes necessary for a RPC client
to identify itself as a secondary tenant by another mean than
the CN field in the TLS client cert.
This patch does by introducing a new optional gRPC metadata header,
"client-tenant".
When absent, we continue to use the tenant ID in the client
cert as the claimed tenant identity of the client,
as previously.
When present, it is used as the claimed tenant identity of the
client. In that case, we allow two TLS situations:
either the client is also using a client tenant cert,
in which case we verify that the tenant ID in the cert
matches the one in the metadata;
or, the client is using a regular client TLS cert,
in which case we verify that it is using a 'root' or
'node' cert, since only these principals are allowed
to perform RPCs in the first place.
The authentication rules are summarized in the following table.
This matrix was designed with the following properties:
a rogue client cannot get more access by adding gRPC metadata than
it would have if it didn't pass gRPC metadata at all. This can be
checked by observing that for a given TLS client cert, the addition of
gRPC metadata always results in authn and authz rules that are at
least as restrictive.
the gRPC metadata remains optional, so as to allow previous version
SQL pods to continue to connect with just a valid TLS cert.
Release note: None