-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu: Expose the total number of keys for TDX #1079
cpu: Expose the total number of keys for TDX #1079
Conversation
Hi @fidencio. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fidencio for the PR.
How are you planning to utilize this label in practice? So is the label actually useful for anything by itself or are you planning to turn that into an extended resource or smth? What I'm trying to understand if this is smth that we want to have that as a built-in feature label, enabled by default for all users.
Also, we need to document this new feature in docs/usage/features.md
and docs/usage/customization-guide.md
2ca797e
to
4db8927
Compare
That's a very good question, and sorry for my lack of explanation on this. |
No problem, that's what the reviews are for
I think my kinda (so far hidden) idea for filling these kind of usage scenarios involving extended resources would be leverage NodeFeatureRule (and possibly NodeFeature) CRD. I.e. to have support for specifying ExtendedResources in the NodeFeatureRule object, basically following the lines of Labels, Taints (#540) and Annotations (#863). But we don't have that capability, yet. Just realized that there's not even an issue tracking that, yet -> will create one. How would NodeFeatureRule serve your use case? One plan could be to merge this PR without the labeling part and try to squeeze in support for ExtendedResources (in NodeFeatureRule CRD) for the next v0.13.0 release. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one nit
@ArangoGutierrez We need that for SNP as well, the total number of ASIDs |
You mean extended resources (#1081)? |
@marquiz Yes and No. (1) We need the extended resource via Rule (2) cpu: Expose the total number of ASIDs for SNP You can only run a specific amount of VMs on a host, so we need to expose that as well. |
4db8927
to
5860678
Compare
@marquiz, I've updated the PR and I'll test it together with @ArangoGutierrez's work and report back very soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm trailing white space
5860678
to
9079836
Compare
Using @ArangoGutierrez's PR, with the following rule ...
... I was able to get ...
|
There's still one bit missing, though, that's getting the amount of used keys. I'm working on that right now. |
No, nevermind. I've done some tests here, let me share the results: I'm using the follow Kata Containers runtime class definition: ---
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata
handler: kata
overhead:
podFixed:
memory: "160Mi"
cpu: "250m"
feature.node.kubernetes.io/cpu-security.tdx.total_keys: "1" Doing a
Then when I start a pod, which has the follow definition: apiVersion: v1
kind: Pod
metadata:
name: nginx-qemu-tdx
spec:
runtimeClassName: kata
containers:
- name: nginx-qemu-tdx
image: nginx:1.14.2
ports:
- containerPort: 80 I can then see, as a result of Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1200m (0%) 0 (0%)
memory 450Mi (0%) 340Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
feature.node.kubernetes.io/cpu-security.tdx.total_keys 1 0 If I try to start a second pod, now requesting 30 keys (+1 coming from the podOverhead definition), using the definition shown below:
Then I can see it doesn't start:
It means the book keeping is correctly done. |
One thing that we may figure out later why it happens is that although the bookkeeping is correctly done the values of the "feature.node.kubernetes.io/cpu-security.tdx.total_keys" (coming from
|
/lgtm Last famous words @marquiz |
LGTM label has been added. Git tree hash: 9a1ce5e4555613fa6cbdcf08fa38d76fd3f53a37
|
The total amount of keys that can be used on a specific TDX system is exposed via the cgroups misc.capacity. See: ``` $ cat /sys/fs/cgroup/misc.capacity tdx 31 ``` The first step to properly manage the amount of keys present in a node is exposing it via the NFD, and that's exactly what this commit does. An example of how it ends up being exposed via the NFD: ``` $ kubectl get node 984fee00befb.jf.intel.com -o jsonpath='{.metadata.labels}' | jq | grep tdx.total_keys "feature.node.kubernetes.io/cpu-security.tdx.total_keys": "31", ``` Signed-off-by: Fabiano Fidêncio <[email protected]>
9079836
to
10672e1
Compare
/lgtm |
LGTM label has been added. Git tree hash: 175450f2c7e94f409a6f2eccdd2f0101a0e50b1a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fidencio for the enhancement, looks good to me 👍
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fidencio, marquiz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The total amount of keys that can be used on a specific TDX system is exposed via the cgroups misc.capacity. See:
The first step to properly manage the amount of keys present in a node is exposing it via the NFD, and that's exactly what this commit does.
An example of how it ends up being exposed via the NFD: