-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proto/agent: add privileged API #7
proto/agent: add privileged API #7
Conversation
Yes... 🤔 Delegation API? Platform API?
I was wondering how to deal with this too. I bet @azdagron has an opinion here. Overall, I think the alignment with the workload api makes sense as a first step. It will be a smaller lift. Eventually, we may think about a different shape in order to better facilitate callers that are subscribing to multiple SVIDs, but that can be introduced later as a new RPC. |
I had a quick peek at this, happy to see that it seems to have been relatively straightforward! Some minor things here and there, but overall looking good IMO. |
Yes, it's something we probably need in Cilium's use case. I agree to do this incrementally. |
I was leaning the other direction 😬 There are a few bits of "baggage" in the Workload API that we may or may not want to replicate here until we need it. A few examples:
I guess it depends on whether we want this API to look more like the workload API or the other SPIRE APIs. Another thing we may wish to change is that currently, federated trust domain bundles are keyed by the SPIFFE ID or the trust domain ( I'm sort of leaning towards making it look and feel like other SPIRE APIs, but I could see how one might want to go the other direction, depending on how the delegatee is going to pass along these materials to the workload.
I agree. What would you all think of something the following, declared locally in this API definition:
|
Aren't bundles rotated?
I don't have strong opinions. Whatever shape that contains the data we need is good for me.
Seems good to me. I only have the question about the bundle rotation above. |
I presume a bundle rotation would result in a new response on the stream, similar to how the Workload API operates. The difference between the two is that instead of repeating that same bundle on each X509SVID in the response, we'd just return one bundle. |
This makes me wonder if we need to have any special consideration around scale and grpc message size |
I tried to put this idea in proto terms and ended up with something like:
I'm not sure if we should be mixing different messages over the same rpc, what about having a separated rpc to get bundle updates? Edit: I'm not that sure that having a different rpc would be good. It just adds complexity. Maybe an option to avoid sending the same bundle again and again would be to make it optional and only send it when there is a change. It'll need some logic on the spire side to understand that it was changed and the client should always check this field. I'm happy to hear more ideas on this @azdagron. |
Yes I agree. I think perhaps @azdagron was suggesting not a separate message for bundle, but having it in the top-level message instead e.g. message FetchX509SVIDsBySelectorsResponse {
repeated X509SVIDWithKey svids = 1;
map<string, bytes> bundles = 2;
...
}
message X509SVIDWithKey {
spire.api.types.X509SVID x509_svid = 1;
// @azdagron why no type for this one?
bytes x509_svid_key = 2;
string spiffe_id = 3;
string trust_domain = 4;
} |
f6af4bc
to
c27f666
Compare
I think you're right, I was over-complicating it.
What about the bundle for this trust domain?, isn't it better to have it as a separated field to avoid always having to perform the lookup in the map? (No strong opinion here, just wondering) If we put the federated bundles here then we have to include a list of
spiffe_id and trust_domain are part of |
Historically, we have the local bundle and the federated bundles in our response message(s). What we've observed is that this arrangement leads to client implementations that don't support federation, because federation ends up being a totally different codepath. As a result, we've been leaning towards an "always check the map" approach, as it unifies the "this" trust domain authN and the federated codepaths rather than special casing the latter.
Hmm ... yes this is a good point. Reading this, I've also realized that I think our bundle repetition optimization is probably in vain because we expect callers to hit this RPC once per workload, rather than once for all workloads... So I think you were probably right in your original assertion of a different RPC, if the goal is to reduce overhead there. It definitely causes client friction. @azdagron and I are chatting offline and will provide some more feedback here soon |
I should clarify that it is not fully in vain, as we get the optimization over SVID responses on a per-workload basis, but in practice there is usually a very small number (almost always 1) of these. |
Having some more time to think on this and chatting with @evan2645 a little bit, I think maybe I was overzealous in my suggestions. I think we should model this closely to what is needed for cilium's use case and emphasize somehow, either through the name and shape of the RPC, and in the documentation that the intention of this API is for a delegatee to obtain the SVID+bundle materials on a per-workload basis. I also worry that I could personally yak-shave this thing to death trying to find the optimal shape that covers all use cases and think that we should just settle on whatever makes sense for cilium for now. If that means a shape identical to the Workload API, so be it. If at a future point we determine a better API, we can always introduce additional RPCs or a V2 API. |
I have been experimenting a bit more with the definition of this API, the different optimization ideas and the Cilium needs. I think there is a good coupling of the Cilium needs and the optimization ideas.
For the cilium use case we need
I think points 1 and 2 are not that difficult, all the different shapes of the API we have discussed include this information. I wanted to avoid discussing point 3 but I think it's better to have that discussion now because it could also affect the definition of this API. I fear having a single gRPC stream for each workload won't escalate well and we'd need to "multiplex" updates for multiple workloads in a single stream. (Any input of your previous experience on this point is more than welcome). We'd need then to define an API that sends updates for multiple workloads on the same stream and that allows the client to add and remove "watched" workloads. After some thinking I have the following idea on mind: service Privileged {
//... other rpcs not important for this discussion
// Fetch X.509-SVIDs for multiple workloads.
rpc FetchX509SVIDs(stream FetchX509SVIDsRequest) returns (stream FetchX509SVIDsResponse);
}
// FetchX509SVIDsRequest is used by the clients to modify the set of workloads
// that they receive SVIDs for.
message FetchX509SVIDsRequest {
enum Operation {
UNSPEC = 0;
// Add a new workload to the watched set.
ADD = 1;
// Delete a workload from the watched set.
DEL = 2;
}
// Operation to perform. Add/delete a workload to/from the watched set.
Operation operation = 1;
// Workload's identifier. The meaning of this field depends on the
// operation:
// ADD: The server will include this id in the messages containing updates
// for this workload.
// DEL: The id of the workload to remove from the watched set.
uint64 id = 2;
// Selectors of the workload to add to the watched set. Only used when
// operation is ADD.
repeated spire.api.types.Selector selectors = 3;
}
message FetchX509SVIDsResponse {
// Id of the workload. This value corresponds to the one used when adding
// the workload to the watched set.
uint64 id = 1;
repeated X509SVIDWithKey x509_svids = 2;
} The rpc is bidirectional streaming to allow the client to add / remove workloads. It's needed in the cilium case as the workloads to watch change when pods are created/deleted. The I'm not sure if it makes sense. What do you think about it? Do you have any different idea to implement a scalable way to watch multiple workloads? Is that feasible to implement in SPIRE? Another possibility is to continue with the one stream per workload rpc and see how far we can get with that before looking for a more optimized way... |
Ok, this is good ... I feel like we are making some progress. What I'm hearing is:
I think the single stream with subscription control messages in one direction and multi-tenant SVID data in the other makes sense for this goal... I also think that we will want a separate stream for bundle. Some considerations are
The consequence will be that clients will have some codepaths for SVID management and bundle management, and those paths may be separate. Here's an idea that iterates on @mauriciovasquezbernal's latest example service Privileged {
// Fetch X.509-SVIDs for multiple workloads.
rpc FetchX509SVIDs(stream FetchX509SVIDsRequest) returns (stream FetchX509SVIDsResponse);
// Fetch local and federated bundles
rpc FetchX509Bundles(empty) returns (stream FetchX509BundlesResponse);
}
// FetchX509SVIDsRequest is used by the clients to modify the set of workloads
// that they receive SVIDs for.
message FetchX509SVIDsRequest {
enum Operation {
UNSPEC = 0;
// Add a new workload to the watched set.
ADD = 1;
// Delete a workload from the watched set.
DEL = 2;
}
// Operation to perform. Add/delete a workload to/from the watched set.
Operation operation = 1;
// Workload's identifier. The meaning of this field depends on the
// operation:
// ADD: The server will include this id in the messages containing updates
// for this workload.
// DEL: The id of the workload to remove from the watched set.
uint64 id = 2;
// Selectors of the workload to add to the watched set. Only used when
// operation is ADD.
repeated spire.api.types.Selector selectors = 3;
}
message FetchX509SVIDsResponse {
// Id of the workload. This value corresponds to the one used when adding
// the workload to the watched set.
uint64 id = 1;
repeated X509SVIDWithKey x509_svids = 2;
}
message X509SVIDWithKey {
spire.api.types.X509SVID x509_svid = 1;
bytes x509_svid_key = 2;
string spiffe_id = 3;
string trust_domain = 4;
repeated string federates_with = 5;
}
message FetchX509BundlesResponse {
string trust_domain_name = 1;
bytes bundle = 2;
} Consumer calls ADD/DEL on fetch svid to manage the stream subscriptions. Consumer also attaches to fetch bundles to receive updates on all bundles the agent has. They are sent one bundle per message (this lets us send update of a single bundle without retransmitting all the rest). X509SVIDWithKey carries a repeated string that states the trust domain names that workload should federate with. The consumer manages the lookup against data received from the bundle stream. There are some smaller gotchas in here, like the id field you added to track state from consumer side ... and some typing related stuff (and naming of the messages) .. mostly curious to hear thoughts on the overall shape and interaction pattern of an approach like above, given everything we've discussed. |
How does a caller know when it can drop a previously received bundle from the state it is caching? Relying on the aggregated |
Hmm that is a good question ... in general, i expected client to track all bundles that agent tracks ... do we have mechanisms for bundle redaction in agent core? we could send an update for td name with empty bundle? Open to other ideas :) I'm having a hard time coming up with an efficient answer that includes bundle data in SVID response... |
Other option is to package all bundle data in one message, but still have them separate from the SVID message(s) |
I think the agent implementation is already going to require some level of "diffing" old state v.s. new on the agent side. We could probably send a deletion event... empty bundle field in the response sounds just as good as anything. |
@mauriciovasquezbernal curious to hear if you have any concerns about handling svid and bundle data separately? It will push some complexity on to the client, but the end result I think will be much more efficient than the proposals we had that include bundle and svid data together |
The goal here is to avoid sending the full selectors each time. I considered also the case where it's the server who creates this ID field, however I found it more difficult because the server will have to tell the client which is the ID for a given set of selectors before sending any update for those. Another idea that could work is to use the hash of the selectors instead. @evan2645 any comment?
Yeah, this could be a little bit difficult from the client side. For instance when there is an SVID that federates with a trust domain that the client hasn't received the bundle for. IMO it's something that the clients could live with anyway.
@evan2645 I don't have any concerns so far. I think it makes a lot more sense now that we would be able to watch multiple SVIDs from the same client. Actually my concern is more about the spire-side implementation. I trust blindly your assessment there. I'll update this PR so we can discuss directly on the code. |
13845f5
to
b9ccd1f
Compare
Add proto definitios for new delegation API. This API includes two rpcs: FetchX509SVIDs and FetchX509Bundles. The FetchX509SVIDs rpc allows a privileged client to get X509-SVIDs for a given set of workloads. This method uses a bidirectional gRPC stream. The server to client flow is used to send SVID updates and the client to server one is used as a control mechanism to add / remove workloads from the watched set. The client can add workloads to the watched set by specifying a list of workload's selectors. X509-SVIDs for registration entries matching *all* the passed selectors are returned. The FetchX509Bundles rpc streams all bundles (the federated and the trust domain one) to the client. Signed-off-by: Mauricio Vásquez <[email protected]> Signed-off-by: Mauricio Vásquez <[email protected]>
b9ccd1f
to
6fd689f
Compare
Hi @azdagron @evan2645. I have been working with the AccuKnox folks (@nyrahul @navarrothiago @rscampos) on this PR and they are going to take over the task. I'll try to participate in the discussions as much as possible. Thanks so much for the help and keep rocking with this project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mauriciovasquezbernal - this is looking pretty great. I left a handful of comments around naming and encoding. Many of the questions are focused on making sure we have the behavior and encoding etc documented as proto comment. I'd like to get @azdagron's thoughts as he is better at this kind of thing than I am :)
Also wanted to apologize for the delay last couple weeks. US holiday plus SPIRE 1.0 release tied us up for quite a while.
I have been working with the AccuKnox folks (@nyrahul @navarrothiago @rscampos) on this PR and they are going to take over the task. I'll try to participate in the discussions as much as possible. Thanks so much for the help and keep rocking with this project!
Thank you very much for everything thus far @mauriciovasquezbernal. It has been very nice working with you.
And also nice to meet you @nyrahul @navarrothiago and @rscampos. Perhaps you'd like to have a quick introductory call to get things handed off?
// clients on the spire-agent configuration. | ||
service Delegation { | ||
// Fetch X.509-SVIDs for multiple workloads. | ||
rpc FetchX509SVIDs(stream FetchX509SVIDsRequest) returns (stream FetchX509SVIDsResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since have a significant departure from the workload API here, perhaps a different RPC name makes sense?
To me, this RPC has a sessionized feel to it.
SubscribeToX509SVIDs
?
StartSessionX509SVID
?
ReceiveX509SVIDs
?
AttachX509SVIDStream
?
The bundle RPC would be updated as well to match e.g. AttachBundleStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of those, I'm leaning towards SubscribeToX509SVIDs the most I think?
// | ||
// The caller must be local and its identity must be listed in the allowed | ||
// clients on the spire-agent configuration. | ||
service Delegation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the idiomatic naming here ... is it Delegate
or Delegation
? @azdagron do you have a preference?
Maybe DelegateSVIDIssuance
or DelegateIssuance
or DelegateWorkloadIdentity
? I do think the term "delegate" is appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sticking with nouns is important, so I'd rather see "Delegated*" than "Delegate*". Even though this isn't quite shaped like our other resource-oriented APIs, I wonder a resource-oriented name makes sense. Maybe DelegatedIdentity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good, we think DelegatedIdentity
is more explicit, tks :D
// - ADD: The server will include this id in the messages containing updates | ||
// for this workload. | ||
// - DEL: The id of the workload to remove from the watched set. | ||
uint64 id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this field, the more it feels like we're building a multiplexed stream on top of a multiplexed stream (i.e. our custom scheme on top of HTTP/2). I can't help but wonder if there are other options that utilize this facility at a lower level rather than duplicate.
Putting that aside, I think this field name could be a bit more descriptive like workload_identifier
or selector_set_id
. Is the field optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the first line, about building a multiplexed stream, do you think that would be better to split the FetchX509SVIDs? Doing so, the new shape will contains the following the RPCs:
rpc SubscribeToX509SVIDs(stream SubscribeToX509SVIDsRequest) returns (SubscribeToX509SVIDsResponse);
rpc UnsubscribeToX509SVIDs(stream UnsubscribeToX509SVIDsRequest) returns (UnsubscribeToX509SVIDsResponse);
rpc AttachX509SVIDsStream(AttachX509SVIDsStreamRequest) returns (stream AttachX509SVIDsStreamResponse);
The SubscribeToX509SVIDs
would be the same as FetchX509SVIDs (operation=ADD)
The UnsubscribeToX509SVIDs
would be the same as FetchX509SVIDs (operation=DEL)
The AttachX509SVIDsStream
would be the FetchX509SVIDs
, but without the operations.
Doing in this way, we just remove this field operation
in the AttachX509SVIDsStream
(the old API FetchX509SVIDs
).
If we just want to subscribe a pod, just use SubscribeToX509SVIDs
. If we just want to fetch the certificate, just use AttachX509SVIDsStream. In this way we can separate all the logics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this field, the more it feels like we're building a multiplexed stream on top of a multiplexed stream (i.e. our custom scheme on top of HTTP/2). I can't help but wonder if there are other options that utilize this facility at a lower level rather than duplicate.
I understand your point. I tried to think about that but I didn't find a solution. Perhaps the input of a grpc / HTTP/2 expert could help here.
Is the field optional?
I don't think it should be optional. If we make this optional, then how the client will be able to create the response-to-workloads relationship?
I think we have two options regarding this field:
- Keep is mandatory and specified by the client as it's now in the PR.
- Remove it completely and replace by something like
selectors_hash
. It'll serve for the same purpose (establishing the response-to-workload relationship) but it won't be specified by the client but instead calculated from the set of selectors. We'll have to define how to calculate the hash of a set of selectors of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpc SubscribeToX509SVIDs(stream SubscribeToX509SVIDsRequest) returns (SubscribeToX509SVIDsResponse); rpc UnsubscribeToX509SVIDs(stream UnsubscribeToX509SVIDsRequest) returns (UnsubscribeToX509SVIDsResponse); rpc AttachX509SVIDsStream(AttachX509SVIDsStreamRequest) returns (stream AttachX509SVIDsStreamResponse);
I should work fine too. I kept everything in a single stream because some possible race conditions, but now that the id
(workload_identifier
/ selector_set_id
) is set by the client those are not relevant anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should strive very hard to bind the "scope" of the subscription to an individual RPC, otherwise it puts the burden of cleanup on the caller, which could go away at any time. We don't want the server holding onto a subscription because it doesn't know that the caller disappeared.
I think the choice here is either 1) have callers start a stream per subscription or 2) have callers aggregate all subscriptions to a single stream, which is how it is currently defined. Both have tradeoffs. I think the only thing you really save with the aggregate approach is maybe some resource overhead imposed by the individual gRPC streams. I don't have a strong preference. From the server implementation perspective, option 1 seems a little easier to implement. Option 2 at first glance seemed easier to implement from a client perspective, but I think it still requires the client to implement code to demux responses and manage subscriptions. It isn't a clear win for me. I a-little-more-than-slightly prefer option 1, but could be convinced to stick with option 2. I also think we could easily change this up later once we learn more. We don't need to get it perfect now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if we got your point, when you say individual RPC, is just only one RPC to do all the functionality (the way its done today - subscribe/unsubscribe/get x509 cert)? In this way, we just use the rpc SubscribeToX509SVIDs
to do all the functionality (as you commented above about the name of the rpc).
Or, having 3 differents streams, one for each rpcs:SubscribeToX509SVIDs
, UnsubscribeToX509SVIDs
and AttachX509SVIDsStream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry I was confusing :)
What I mean is that the lifetime of a subscription should be tied to an invocation of a gRPC method. If we split the individual RPC we have now into a Subscribe and Unsubscribe RPC, then we have to handle clients never calling Unsubscribe, which complicates the server needlessly. If we only have a single RPC that they interact with, then the lifetime of the subscription naturally aligns to the lifetime of the stream; when the stream is closed, all of the subscriptions opened by that stream can be cleaned up.
Considering all of that, I'm a fan a single RPC. I slightly prefer having a single subscription for each invocation instead of multiplexing multiple subscriptions over the same gRPC stream, but like I mentioned above, I can live with the latter. I don't see how it really reduces code complexity in the client that much, but maybe it saves per-gRPC stream overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem @azdagron, I think now I got it :)
So for now we just let SubscribeToX509SVIDs
and do all the things in only one gRPC.
In addition, today, our current PoC is sending all request in only one stream. Doing so its possible to do the subscribe process for multiple client - informing the id
of the client. Should it be ok?
// empty bundle indicates that it was removed. | ||
message FetchX509BundlesResponse { | ||
string trust_domain_name = 1; | ||
bytes bundle = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bundle encoding and format?
I think workload api does DER-encoded X.509 certs here which would probably be fine.
Should it be repeated bytes ca_certificates
or something? The term "bundle" is pretty overloaded, but it still feels appropriate as the message name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we think repeated bytes ca_certificates
would be better, to be more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like repeated
here.
// X.509 SPIFFE Verifiable Identity Document with the private key. | ||
message X509SVIDWithKey { | ||
spire.api.types.X509SVID x509_svid = 1; | ||
bytes x509_svid_key = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key encoding? I recommend DER PKCS#8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should define this encoding (DER PKCS#8) in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is what @evan2645 intends. Otherwise it isn't immediately clear to callers how to parse the result.
spire.api.types.X509SVID x509_svid = 1; | ||
bytes x509_svid_key = 2; | ||
|
||
repeated string federates_with = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more description here would be good
repeated string federates_with = 3; | |
// Names of the trust domains that this workload should federates with | |
repeated string federates_with = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more general comment. Once we finalize the field names here, we should go through and add Require.
prefix to the field comments for required fields. It makes it easy to see at a glance what is minimally expected in each request/response. See the other API descriptions for examples.
Guys, we just send a new PR #8 related to all considerations discussed here :D |
Gonna go ahead and close this pr in favor of #8 Thank you @rscampos and @navarrothiago for preserving @mauriciovasquezbernal's authorship. And special thank you to @mauriciovasquezbernal for kicking this off and getting everyone to a point where we're almost ready to go 😄 🚀 |
Signed-off-by: Andrew Harding <[email protected]>
Add proto definitios for new privileged API. This API includes (for now)
the FetchX509SVIDBySelectors that allows a privileged client to get
the X509-SVID of a given workload by passing a set of selectors.
X509-SVID for registration entries matching all the passed selectors
are returned.
Fixes #4
Questions
message X509SVID
, it seems wrong to have it as it's a common type. However the existing one doesn't include the private key, crl nor the federated bundles. Should we create a newX509SVIDWithKey
or similar message type?I implemented a POC of this API in spire -> https://github.com/kinvolk/spire/commits/mauricio-privileged-attestation-api-poc-v3.