-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass down resources to CRI #4113
base: master
Are you sure you want to change the base?
Pass down resources to CRI #4113
Conversation
marquiz
commented
Jun 28, 2023
- One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
- Issue link: Pass down resources to CRI #4112
- Other comments:
Co-authored-by: Antti Kervinen <[email protected]>
@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2; | ||
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the keys here be a special type instead of unstructured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to have smth like type ResourceName string
in protobuf. Please correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @haircommander, are you satisfied with the reply (close as resolved)?
/retitle Pass down resources to CRI |
@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name. |
@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marquiz.
It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources
entry (e.g. vendor.com/xpu: 1
) to the resources added to the container, or DRA where the associated resources.requests.claims
entry is not mentioned.
|
||
#### Story 3 | ||
|
||
As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I updated Story 3, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
requests: | ||
cpu: 100m | ||
memory: 100M | ||
vendor.com/xpu: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very much true. I think I'll add a note about this in the KEP somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I added a not about device plugin resources after this example. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
WindowsPodSandboxConfig windows = 9; | ||
+ | ||
+ // Kubernetes resource spec of the containers in the pod. | ||
+ PodResourceConfig pod_resources = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI
Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API. |
@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it. |
OK, PRR looks good - I will approve after the SIG gives approval. |
+ INIT_CONTAINER = 0; | ||
+ SIDECAR_CONTAINER = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this distinction one we want to make? does the kubelet make that distinction? cc @SergeyKanzhelev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside kubelet, it is known that "init" container would be "one shot" or will be running during whole pod life.
In case of passing down to CRI, we need to have that info for sizing VMs (so, to know how many containers expected to run during whole life of the pod).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a slippery slope to start introducing new types of containers. With the sidecar containers you need to know two things:
- that it is a sidecar
- position of a container in the list of init containers.
Position is very important as it defines how the total resources will be calculated.
Keeping structure as close to Pod Spec as possible would be helpful it we will introduce any other container types in future where ordering or any other attributes will also be important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
to think if we can change the structure to make it closer to how pod spec look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and thus that structure version number thing I was thinking we might need...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily to version it the same as pod. I think once we get to Pod v2, we may have a new CRI API as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal had separate lists for init container and containers. Based on review feedback (e.g. #4113 (comment)) it was changed in b3703c6 to a single list, with a "container type" field.
The order IS important. Kubelet creates containers sequentially, The order of the single list is supposed to be that order (effectively append(initContainers, containers)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the suggestion to mark init containers as sidecars? I am thinking - what if we will introduce a new property of a container that can be exhibited by a container of any type? Will it require us to introduce 6 more container types? Or we will just add a new field? If a new field - why not have a field "restartable" on init container in this implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for the proposed way is to keep it simple here, have the minimum to enable right-sizing of the Pod. Encode the expected lifecycle of each container in the Pod, enabling the calculation of "how many containers will/can be running at the same time".
FWIW, the only new type of container I came up with (but didn't include here) would be a "teardown container", running after all regular containers have finished.
+enum ContainerType { | ||
+ INIT_CONTAINER = 0; | ||
+ SIDECAR_CONTAINER = 1; | ||
+ REGULAR_CONTAINER = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I'd just call this CONTAINER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree regular is extraneous/default. Typing the container could prove useful. What happens when a container is not enumerated? I wonder if any container not defined at the time of running the pod be considered ephemeral?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, changed as suggested to plain CONTAINER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if any container not defined at the time of running the pod be considered ephemeral?
Yes, I think that's the case
a couple of questions about the exact API, but no need to address those here, we can come to consensus on implementation /approve from my side, @mikebrow any additional thoughts? |
- add UpdatePodSandboxResources CRI rpc (this is covered by [KEP-1287][kep-1287], [PR][kep-1287-beta-pr]) | ||
- add pod-level resource requirements (this is covered by [KEP-2837][kep-2837], [PR][kep-2837-alpha-pr]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please clarify what does those two statements mean? Do you mean that those KEPs will need to contribute to this API?
The 1287
is targeting beta, while this one is targeting alpha. So this KEP will need to clarify how it will work with InPlace updates
cc: @tallclair @ndixita
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also remove these from non-goals. I added these here because we talk about UpdatePodSandboxResources message and the pod level resource requirements later on in the proposal - I wanted to clarify that these additions are not part of this proposal (KEP-4112).
UpdatePodSandboxResources
is not yet in the CRI API. KEP-1287 proposes to add that in Beta. So we'd want to sync with that
WDYT @SergeyKanzhelev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly want to make sure we are not making changes that are not compatible with these two KEPs. So listing them as non-goals sounds like a bad idea. Maybe not a goal for alpha, but definitely a goal for beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly what I want, too. To be in-sync between all the KEPs. I tried to state that this KEP (4112)
- is NOT proposing to add the new
UpdatePodSandboxResources
CRI rpc - is NOT proposing to add pod-level resource requirements
- is proposing to align with those features when they land
The design details section contains details what changes (to align) would be made.
I now dropped these two non-goals as they cause more confusion than clarity. Maybe, instead, we should add in Goals to align with them?
|
||
With this information, the runtime can for example do detailed resource | ||
allocation so that CPU, memory and other resources for each container are | ||
optimally aligned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it work with CPU manager? Is there any conflicts or any information we need to pass earlier than we do today w.r.t. decisions CPU manager made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ffromani
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU manager is not affected. It does its work before we pass the info down to the runtime. This particular sentence is referring to scenarios where cpu manager none
policy is be used in Kubernetes. Do you think that this should be mentioned in the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be mentioned.. with (story 1/2) and without cpu manager (story 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to mention. Any hints that kubelet has needs to be passed into the sandbox creation now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to mention.
I the following sentence in this paragraph:
This applies to scenarios where the kubelet CPU manager is disabled (by using the none
CPU manager policy).
Any hints that kubelet has needs to be passed into the sandbox creation now?
No. I don't think we want/need spill these details over. Also, the runtime (or NRI plugin) might have better understanding of the HW topology
Also the CreateContainer request is extended to include the unmodified resource | ||
requirements. This make it possible for the CRI runtime to detect any changes | ||
in the pod resources that happen between the Pod creation and container | ||
creation in e.g. scenarios where in-place pod updates are involved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a possibility? Would there be a sandbox adjustment call in-between? Or it is not expected?
@tallclair are we allowing the container resource changes between the sandbox creation and conatiner creation? Maybe while init container is executing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm not entirely sure about all the dirty details here. But that exactly was the thinking: a loong-running init container, and pod gets updated before that is completed. And, because there are a lot of delicate details, this would also make small adjustments/touch the container lifecycle handling in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for ephemeral?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, ephemeral containers cannot have any resource requirements (in the Kubernetes API/PodSpec level). But your comment @mikebrow is good in that it might change in the future(?) Nothing in CRI or OCI prevents ephemeral containers having resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would have to check but I think the extra mounts may be different to include debug tools missing from a copied container in the pod..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, at least the mount for the logfile is probably different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod.. so a certain exception for that container type.. also not clear we should explicitly fail if the mounts have changed which they might on a container restart due to modified volume drivers..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect sure.. what to do with that detection is probably a long running set of discussions and use case definitions. Stretch goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, there are no changes wrt mounts (or devices) in CreateContainer request. Everything continues to work as it does now. Not related to this KEP, but with ephemeral containers there might scenarios where they cannot be supported, e.g. CoCo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ias @mikebrow said, if we have this information on sandbox creation and for some reason anything took a dependency on it, it will open the whole can of worms on how to handle all sorts of edge cases
in the pod resources that happen between the Pod creation and container | ||
creation in e.g. scenarios where in-place pod updates are involved. | ||
|
||
The UpdatePodSandboxResources CRI message is also updated when/if that is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in a sense that there will be an additional call in-between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the message will be updated. Re-phrased this paragraph now to:
[KEP-1287][kep-1287] Beta ([PR][kep-1287-beta-pr]) proposes to add new
UpdatePodSandboxResources rpc to the CRI API. If/when KEP-1287 is implemented
as proposed, the UpdatePodSandboxResources CRI message is updated to include
the resource information of all containers (aligning with
UpdateContainerResourcesRequest).
The UpdatePodSandboxResources CRI message is also updated when/if that is | ||
introduced by the [KEP-1287][kep-1287] Beta ([PR][kep-1287-beta-pr]). | ||
|
||
Information about the Pod-level resources are added when/if the Pod-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndixita can you please confirm it is tracked in your KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-phrased this paragraph to
[KEP-2837][kep-2837] Alpha ([PR][kep-2837-alpha-pr]) proposes to add new
Pod-level resource requirements field to the PodSpec. This information will be
be added to the PodResourceConfig message, similar to the container resource
information, if/when KEP-2837 is implemented as proposed.
The intention is to sync with the other KEPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tracked in Pod Level Resources KEP. While I am trying to understand this KEP, I have a question: Will the VM based runtimes benefit from knowing the overall requirements of all the containers in a pod and not requirements of individual containers? Again, this question is just for my understanding...
request. | ||
|
||
```diff | ||
message PodSandboxConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big this structure will be? What would be the max number of containers we can report in this structure? (what is the message size limit today?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By quick search the gRPC message size limit is 4MB. This is a lot more than the maximum size of a Pod object in Kubernetes. I believe we should not hit a limit in any practical scenario. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4Mb my be ok.
- rename enum REGULAR_CONTAINER -> CONTAINER - update references to other related keps under Design Details
WindowsPodSandboxConfig windows = 9; | ||
+ | ||
+ // Kubernetes resource spec of the containers in the pod. | ||
+ PodResourceConfig pod_resources = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what runtime suppose to do when this does not match values provided in resources
of LinuxPodSandboxConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no overlapping information between these two structures. Thus a conflict should not be possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand. Maybe I am missing something here, but you should be able to calculate what is in resources
based on pod spec. Is it correct? So some implementation may take dependency on specific way one is calculated. Thus the question, what if they will not be matching at some point. Also if they are inter-dependent, perhaps some of it may be deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime is not calculating resources of LinuxPodSandboxConfig
. They are set by the kubelet
|
||
The resources (mounts, devices, CDI devices, Kubernetes resources) in the | ||
CreateContainer request should be identical to what was (pre-)informed in the | ||
RunPodSandbox request. If they are different, the CRI runtime may fail the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we separate cases when VM was restarted and we don't really know what was originally passed to the sandbox creation any longer? I do not understand this consistency check argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this paragraph seems outdated, especially in the light of the in-place pod updates. The paragraph was added as a response to some review feedback (I just couldn't find that comment now)
How about re-phrasing it for example:
In some usage scenarios the container creation may fail if
the resources (mounts, devices, CDI devices, Kubernetes resources) in the
CreateContainer request do not match what was (pre-)informed in the
RunPodSandbox request. This may be the case e.g. in when
changes are not allowed after a VM-based Pod has been created.
WDYT @SergeyKanzhelev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe he means the original pod metadata plus any pod updates must be applied to the pod before the container is created.
### kubelet | ||
|
||
Kubelet code is refactored/modified so that all container resources are known | ||
before sandbox creation. This mainly consists of preparing all mounts (of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility that today some mounts can ONLY be created when container is being created (some thing that init container needs to initialize). Will this break this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed/investigated in #4113 (comment)
I think there should breakages because of this.
WDYT @SergeyKanzhelev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When appropriate, image mounts included in the container config are added at create container time when we build the rootfs.. additionally oci level plugins, nri plugins, cdi, and cri proxy tools may add mounts. But these are not the "extra" mounts passed in from the host level CRI client (kubelet), additionally the OCI image volume mount is added by the container runtime during create container. In the OCI volume case this is why I'm asking for the oci volume image reference..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Mount
structure is exactly the same (with the same information) that what will be passed as part of the CreateContainerRequest/ContainerConfig message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually goes back to this comment: #4113 (comment)
Would it be expected that runtimes will need to mount everything at pod sandbox creation time now? Or the behavior is not defined? Will we require that the old way (without passing all this extra context in sandbox creation) should continue working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtimes do not need to change the way they operate. They don't need to mount everything at sandbox creation.
All of the data is something that the runtime CAN use if it needs it. E.g. this paragraph of the proposal tries to state that. Maybe we need to state that more clearly in more places(?) And certainly comments in the cri proto file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CAN part worries me. If something CAN be used the only assumption we can make is that it IS BEING USED. This is why maybe we should have two modes of operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev as it was mentioned above, this information is really needed in some scenarios, and OCI runtimes that are dependent on that information will be using it, to fix current problem. Other runtimes that don't have problems now are expected to ignore it.
Besides comments above, I think it is important to clarify and even test as part of critest the requirement for Container Runtime. It goes to this question: #4113 (comment) Some critest ideas might be:
I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic. All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me. Another alternative would be to have two modes of runtime. One mode with all information available, and every dynamic change recreates the Pod. And another mode - what we have today. And runtime will need to tell on startup which mode it wants to work in. |
Can you elaborate on the direction of "making Pod more dynamic"?
What are you scared of, or what are you imagining where this is going?
We will have more and more sandboxed environments that will need such information. The use case with Confidential Containers and GPUs (or any other accelerator) is not realizable with the current architecture. Additionally, there is HW that you cannot hot-plug during runtime, and you need a static configuration at sandbox creation time. |
This thing is targeted to solve the problem that CoCo VMs have (and impossible to fix without additional info passed down), and not trying to deprecate or change meaning of any other Pod lifecycle messages or events meaning.
If there are no bugs in the kubelet code, there should be no mismatches between raw data derived from pod spec with fields calculated to
Device and mount paths are provided at pod sandbox creation time as additional information. At the time when
In fact, we are not going against direction of pod and containers to be dynamic. It is the opposite: in every piece of additional data passed, we are reflecting current state of the object. e.g. "At pod creation we know this much info of workload", on "container creation, something already got changed, so here is current state about that container.". This allows us to do righsizing of VM during creation including all possible heuristics based on information known at that time, as well failing creation of container on (re-)start if something changed since
We don't really need two modes. We just need enough information to make decisions at existing pod lifecycle events, as well proper error handling for scenarios where dynamic resize of the pod or container is not feasible due to some reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for moving forward with the KEP under the proviso that a follow up KEP update is required to sync with 1287 and 2837, to cover message/api details not yet discovered, and other dangling issues still in open discussion. I also expect us to find needed additions when we have a cc prototype.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haircommander, marquiz, mikebrow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Interesting points. Mode works to a certain degree and we already have handler as just such a mode selection. I think the kep will definitely need to be updated to handle the dynamic pod resource changes.. (new networks, different resource claims/devices) and when a handler is found to not be dynamic.. we'll just have to cycle the pod by returning error handler does not support this update type and / or tainting the handler type ahead of time. |
@SergeyKanzhelev @mrunalp How to move this forward? |
Hi folks, any updates of this KEP ? Any ideas to help move it forward ? |
- update kep.yaml to target v1.33 - update references to kep-1287 and kep-2837
Updated:
The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections. |