- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Naming
- Risks and Mitigations
- Scenario 1. User decides to use sidecars as a way to run regular containers
- Scenario 1.a
- Scenario 2. Balloon sidecars
- Scenario 3. Long initialization tasks running in-parallel
- Scenario 4. Sidecar that never becomes ready
- Scenario 5. Intentional failing or terminating sidecars
- Scenario 6. Keeping a sidecar alive to keep consuming cycles on termination
- Scenario 7. Risk of porting existing sidecars to the new mechanism naively
- Design Details
- Backward compatibility
- kubectl changes
- Resources calculation for scheduling and pod admission
- Exposing Pod Resource requirements
- Topology and CPU managers
- Termination of containers
- Other
- Test Plan
- Prerequisite testing updates
- Unit tests
- Integration tests
- e2e tests
- Ready state of a sidecar container is properly used to create/delete endpoints
- Pod lifecycle scenarios without sidecar containers
- Pod lifecycle scenarios with sidecar containers
- Kubelet restart test cases
- API server is down: failure to update containers status during initialization
- Resource usage testing
- Upgrade/downgrade testing
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Future use of restartPolicy field
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Sidecar containers are a new type of containers that start among the Init containers, run through the lifecycle of the Pod and don’t block pod termination. Kubelet makes a best effort to keep them alive and running while other containers are running.
The concept of sidecar containers has been around since the early days of Kubernetes. A clear example is this Kubernetes blog post from 2015 mentioning the sidecar pattern.
Over the years the sidecar pattern has become more common in applications, gained popularity and the use cases are getting more diverse. The current Kubernetes primitives handle that well, but they fall short for several use cases and force weird work-arounds in the applications.
The next sections expand on what the current problems are. But, to give more context, it is important to highlight that some companies are already using a fork of Kubernetes with this sidecar functionality added (not all implementations are the same, but more than one company has a fork for this).
Imagine you have a Job with two containers: one which does the main processing of the job and the other is just a sidecar facilitating it. This sidecar could be a service mesh, a metrics gathering statsd server, etc.
When the main processing finishes, the pod won't terminate until the sidecar container finishes too. This is problematic for sidecar containers that run continuously.
There is no simple way to handle this on Kubernetes today. There are work-arounds for this problem, most of them consist of some form of coupling between the containers to add some logic where a container that finishes communicates it so other containers can react. But it gets tricky when you have more than one sidecar container or want to auto-inject sidecars.
The sidecar will also not be restarted for jobs with restartPolicy:Never
when
it was OOM killed, which may render the pod unusable if the sidecar provided
secure communication to other containers. The issue gets complicated by the fact
that sidecar containers typically have smaller request, making them the first
target as OOM score adjustment uses the request as a main input for calculation.
A log forwarding sidecar should start before several other containers, to simplify getting logs from the startup of other applications and from the Init containers. Let's call "main" container the app that will log and "logging" container the sidecar that will facilitate it.
If the logging container starts after the main app, special logic needs to be implemented to gather logs from the main app. Furthermore, if the logging container is not yet started and the main app crashes on startup, those logs are more likely to be lost (depends if logs go to a shared volume or over the network on localhost, etc.). While you can modify your application to handle this scenario during startup (as it is probably the change you need to do to handle sidecar crashes), for shutdown this approach won't work.
On shutdown the ordering behavior is arguably more important: if the logging container is stopped first, logs for other containers are lost. No matter if those containers queue them and retry to send them to the logging container, or if they are persisted to a shared volume. The logging container is already killed and will not be restarted, as the pod is shutting down. In these cases, logs are lost.
The same things regarding startup and shutdown apply for a metrics container.
Service mesh presents a similar problem: you want the service mesh container to be running and ready before other containers start, so that any inbound/outbound connections that a container can initiate go through the service mesh.
A similar problem happens for shutdown: if the service mesh container is terminated prior to the other containers, outgoing traffic from other apps will be blackholed or not use the service mesh.
However, as none of these are possible to guarantee, most service meshes (like Linkerd and Istio), use fragile and platform specific workarounds to have the basic functionality. For example, for termination, projects like kubeexit or custom solutions are used.
Some service meshes depend on secret (like a certificate) downloaded by other init container to establish secure communication between services. This makes the problem of ordering of sidecar and init containers harder.
Another complication is between log forwarding and service mesh sidecars. Service mesh sidecars would provide the networking while log forwarding needs to be active to upload logs. Startup and tear down sequence of those may be a complicated problem.
Some pods use init containers to pull down configuration/secrets and update them before the main container gets it. Then use sidecars to continue to watch for changes and perform the updates and push to the main container. This requires two separate code paths today. Perhaps the same sidecar could be used for both cases.
This proposal aims to:
- make containers implementing the sidecar pattern first class citizens inside a Pod
- solve a Job completion issue when sidecars should run continuously
- allow mixing initContainers and sidecars for a choreographed startup sequence
- allow easy injection of sidecar containers in any Pod
- (to be evaluated after alpha) allow to implement sidecar containers that will guaranteed to be running longer than regular containers
This proposal doesn't aim to:
- support arbitrary dependencies graphs between containers
- act as a security control to enforce that pod containers only run while the sidecar is healthy. Restart of sidecar containers is a best effort
- allow to enforce security boundaries to sidecar containers different that other containers. For example, allow Istio to run privileged to configure ip tables and disable other containers from doing so
- (alpha) support containers termination ordering
The proposal is to introduce a restartPolicy
field to init containers and use
it to indicate that an init container is a sidecar container. Kubelet will start
init containers with restartPolicy=Always
in the order with other init
containers, but instead of waiting for its completion, it will wait for the
container startup completion.
The condition for startup completion will be that the startup probe succeeded
(or if no startup probe defined) and postStart
handler completed. This
condition is represented with the field Started
of ContainerStatus
type. See
the section "Pod startup completed condition"
for considerations on picking this signal.
The field restartPolicy
will only be accepted on init
containers as part of this KEP. The only supported value proposed in this KEP is Always
. No other
values will be defined as part of this KEP. Moreover, the field will be
nullable so the default value will be "no value".
Other values for restartPolicy
of containers will not be accepted and
containers will follow the logic is currently implemented (documented
here
and more details can be found in the section
"Future use of restartPolicy field").
Sidecar containers will not block Pod completion - if all regular containers complete, sidecar containers will be terminated.
The restartPolicy
field for individual init containers can override the
Pod-level restartPolicy
for sidecar containers. As a result, even if the Pod's
restartPolicy
is set to Never
or OnFailure
, sidecar containers will still
be restarted.
Note, a separate KEP #4438 will enable sidecar containers to be restarted even during Pod termination.
In order to minimize OOM kills of sidecar containers, the OOM adjustment for these containers will match or exceed the OOM score adjustment of regular containers in the Pod. This intent to solve the issue kubernetes/kubernetes#111356
As part of this KEP we also will be enabling for sidecar containers (those will not be allowed for other init containers):
PostStart
andPreStop
lifecycle handlers for sidecar containers- All probes (startup, readiness, liveness)
- Readiness probes of sidecars will contribute to determine the whole Pod readiness.
kind: Pod
spec:
initContainers:
- name: vault-agent
image: hashicorp/vault:1.12.1
- name: istio-proxy
image: istio/proxyv2:1.16.0
args: ["proxy", "sidecar"]
restartPolicy: Always
containers:
...
This section explains the motivation for naming, assuming that sidecar containers and init containers belong to the same collection and distinguished using a field. Other alternatives are outlined in other section.
For this KEP it is important to have sidecar containers be defined among other
init containers to be able to express the initialization order of containers.
The name initContainers
is not a good fit for sidecar containers as they
typically do more than initialization. The better name can be “infrastructure”
containers. The current idea is to implement sidecars as a part of
initContainers
and if this introduces too much trouble, the new collection
name may replace the old collection name in future.
Alternative is to introduce a new collection infrastructureContainers
that
replaces semantically initContainers
and deprecate the initContainers
. This
collection will allow both - init containers and sidecar containers. Decision on
this alternative can be postponed to after alpha implementation.
Another alternative is to instead of containers, insert placeholders to the
initContainers
collection. Containers themselves are defined in other
collection. This option will likely confuse end users more than will help.
The per-container restart policy was a long standing request from the community. Implementing per-container restart policy introduces a set of challenging problems for the pod lifecycle. For example, the state keeping for containers which has already run to completion. Introducing sidecar containers is another scenario where this field can be semantically used.
Introducing this field on containers opens up opportunities to implement those long-standing requests from the community.
Alternative is to introduce a new field: ambient: true
on all containers.
This property will make containers be restarted all the time, and will not block
the pod completion.
- Pros: detaching sidecar proposal from the per-container
restartPolicy
proposal - Cons: this field is a new concept that will be introduced for the same
property that is typically controlled by
restartPolicy
.
The semantic of an Always
enum value is very close to what sidecars require.
This is why reusing Always to represent sidecars makes a lot of sense for Init
containers.
There are a few pros and cons to reuse Always
as the value instead of
introducing a new enum value
UntilPodTermination
/UntilPodShutdown
/WithPod
/Ambient
for restartPolicy
on containers.
Pros:
- Allows the same enum values to be used for the
restartPolicy
of both pods and containers, but semantics that are mostly the same. - Less
Cons:
- There are slight differences between the semantics of
Always
for containers andAlways
for pods. The main difference is that ainitContainer
withAlways
will be terminated when the pod terminates and has no influence over the pod lifecycle. Also, for Pods, the defaultrestartPolicy
is documented asAlways
but forinitContainers
it will default toOnFailure
. We believe this can be addressed in the documentation of the existing enum fields.
When in future we may support Always
on regular containers, there will be
interesting case of Always
having a meaning of non-blocking container for the
Pod with restartPolicy == Never
or OnFailure
. This is easy to explain - Pod
lifecycle is controlled by containers with the matching restartPolicy. But it
may be slightly confusing and needs to be carefully reviewed at the time this
feature will be considered.
The following is the list of hypothetical scenarios how users may decide to abuse the feature or use it in a non-designed way. This is an exercise to understand potential side effects from implementing the feature. We are looking for side effects like abuse of the functionality to implement something error-prone that sidecars were not designed for. Or causing issues like noisy neighbors, etc.
At least one regular container is required by the Pod. Users may decide to run all other containers as sidecars and have a single container with the Sleep logic inside with timeout or waiting for some cancellation signal from custom job orchestration portal. This way users may implement execution with a time limit defined by that logic. This is something that users cannot express today “declaratively” and will be possible using new sidecar containers. One may imagine a scenario when a third party jobs orchestration tool converts all containers to sidecar containers programmatically and injects a single job orchestration container into the Pod. This can be seen as an abuse of the concept, and may lead to issues when pods terminate unexpectedly or Jobs meant to be run as regular containers start being run as sidecars being restarted on Completion. However we do not see a major problem with it as this use of sidecar containers will unlikely lead to "unexpected" behavior - all side effects are quite clear.
Another reason for users to implement regular containers as sidecars would be to use any special properties kubelet may apply to sidecar containers. For example, if restart backoff timeouts will be minimized for sidecar containers (which was proposed and rejected), users may decide to use this feature for regular containers by running those as sidecars. Current proposal as written doesn’t introduce any special properties for sidecars that users may start abusing. Special OOM score adjustment will unlikely be useful as this kind of abuse will likely be needed for bigger containers that does not have the problem of OOM score adjustment to be too low.
User decides to run regular container as a sidecar to simply guarantee the
startup sequence of the containers. Users can already implement the startup
sequencing by using blocking nature of postStart
hook. So using the sidecar
containers instead is not adding any benefits.
As for injecting sidecars into Pods with regular containers run as sidecars - webhooks will likely inject sidecar containers to be first, so the risk is minimal here as well.
Users may decide to start the “large request” sidecar containers early to pre-allocate resources for other containers. The same time asking for less resources for a regular containers hoping to reuse what is allocated by the “balloon” sidecar. This was previously impossible to pre-allocate resources like CPU before any init containers run and may be more critical when resource managers like CPU or topology managers are being used. If this pattern have any benefits, people may be incentivized to abuse sidecars this way. However it’s unclear if this has any benefits for users at all.
Users may decide to implement long initialization tasks that will run in-parallel with other initialization tasks. Users may also decide to run new type of initialization tasks like image preloading for workload containers from the sidecar container, which will make it run in parallel with the Initialization tasks. This is impossible to do today as only a single Init container is run at a time today. For containers with lengthy initialization this pattern may be abused and can lead to the pattern of sidecars synchronization when the first workload container will wait for all Init sidecar containers to complete. This pattern may lead to race conditions and be error prone.
Today similar behavior can be achieved by running initialization tasks as
regular containers, with the special container that blocks workload from
execution using synchronization logic in the PostStart
handler. Sidecars support
makes implementation of this feature easier.
There is not much risk, however, even if the user abuses this pattern by converting all init containers in sidecars. Whenever this pattern is useful, the user will most likely need to spend time understanding the logic of different Init Containers and making sure it is not conflicting. This pattern also likely prohibited for large scale customers because sidecar-based initialization will cost resources even after initialization is complete.
This is the failure mode when the sidecar never becomes ready. This is functionally equivalent to the init container that never completes and not allowing users to implement any other ways to abuse kubelet.
Users may implement a sidecar container that intentionally crashes or terminates often. This scenario functionally similar to a regular container restarts often so not much additional overhead or side effects will be created.
On a multitenant cluster, you may have time limits placed on jobs to enable fair usage across tenants. If a sidecar can prevent termination indefinitely, it could be used to perform computation outside the allowed limit.
There is risk associated with people moving sidecars as implemented today to use the new pattern. We didn’t receive any feedback on potential downsides. One scenarios that may be affected is if sidecars decided to terminate itself and kubelet keeps trying to restart it as one of the main containers are still being terminated. Based on current patterns for sidecar containers, this is not likely the problem.
Another potential problem may be that sidecars will wait for some condition to mark itself “started” that cannot be met with the new pattern. For example, wait for other containers to fully start. As sidecar will not be marked as startup completed, other init containers will not run and Pod will be stuck on initialization. For example, Knative's sidecar does aggressive probing of the user's container to ensure they're ready prior to marking the sidecar ready itself. This prevents the Pod from being included K8s ready endpoints. See the section "Pod startup completed condition" for more details.
Switching to the new sidecars approach will slow down Pod start for Istio. Istio today is not blocking other containers to start during its initialization. With the switch to the new model, the separation of Initialization stage and main containers running stage will be more explicit and many implementations will likely wait for sidecar initialization, effectively slowing down Pod startup comparing to current approach. This can be eliminated by slight redesign of a sidecars.
The new field means that any Pod that is not using this field will behave the same way as before.
The new field will only work with the proper control plane and kubelet. Upgrade and downgrade scenarios will be covered in the further sections.
Outside of Kubernetes-controlled code, there might be third party controllers or existing containers relying on the current behavior of init containers. Behaviors they can rely on:
- Assuming nothing is running in the Pod when init container starts. For instance taking PID of the process assuming there are no sidecars.
- Incorrectly calculating the resource usage of a Pod not taking the new type of containers into account (e.g. some grafana dashboards may need modification)
- Stripping the restartPolicy:Always from init container as an unknown field rendering Pod unfunctional as it will not pass the init stage after this
- OPA rules may reject the Pod with new unknown fields because of failure to parse the new field
These potential incompatibilities will be documented.
The kubectl get pods
filters all the Init containers from output when Pod is Running.
As part of this KEP, the output will be extended to include status of sidecar Containers.
For the Pod:
initContainers:
- name: init-config
containers:
- name: sidecar-1
- name: sidecar-2
- name: main
Initialization (Waiting)
NAME READY STATUS RESTARTS AGE
test 0/3 Init:0/1 0 0s
Running
NAME READY STATUS RESTARTS AGE
test 3/3 Running 0 35s
For the Pod:
initContainers:
- name: init-config
- name: sidecar-1
restartPolicy: Always
- name: sidecar-2
restartPolicy: Always
containers:
- name: main
What we have today:
Initialization (Waiting)
NAME READY STATUS RESTARTS AGE
test 0/1 Init:0/3 0 0s
NAME READY STATUS RESTARTS AGE
test 0/1 Init:1/3 0 5s
NAME READY STATUS RESTARTS AGE
test 0/1 Init:2/3 0 10s
Running
NAME READY STATUS RESTARTS AGE
test 1/1 Running 0 35s
What will be returned as part of the KEP implementation:
Initialization (Waiting)
NAME READY STATUS RESTARTS AGE
test 0/3 Init:0/3 0 0s
NAME READY STATUS RESTARTS AGE
test 0/3 Init:1/3 0 5s
NAME READY STATUS RESTARTS AGE
test 0/3 Init:2/3 0 10s
Running
NAME READY STATUS RESTARTS AGE
test 3/3 Running 0 35s
When calculating whether Pod will fit the Node, resource limits and requests are being calculated.
Resources calculation will change for Pod with sidecar containers. Today resources are calculated as a maximum between the maximum use of an InitContainer and Sum of all regular containers:
Max ( Max(initContainers), Sum(Containers) )
With the sidecar containers the formula will change.
Easiest formula would be to assume they are running the duration of the init stage as well as regular containers.
Max ( Max(nonSidecarInitContainers) + Sum(Sidecar Containers), Sum(Sidecar Containers) + Sum(Containers) )
However the true calculations will be different as all init containers that completed before the first sidecar containers will not need to account for any sidecar containers for the maximum value calculation.
So the formula, assuming the function:
InitContainerUse(i) = Sum(sidecar containers with index < i) + InitContainer(i)
Is this:
Max ( Max( each InitContainerUse ) , Sum(Sidecar Containers) + Sum(Containers) )
There is also a Pod overhead that is being added to the resource usage. This section assumes it will be added by kubelet independently from this formulae computation.
It’s currently not straightforward for users to know the effective resource requirements for a pod. The formula for this is:
Max ( Max(initContainers), Sum(Containers)) + pod overhead
.
This is derived from the fact that init containers run serially and to completion before non-init containers. The effective request for each resource is then the maximum of the largest request for a resource in any init container, and the sum of that resources across all non-init containers.
The introduction of in place pod updates of resource requirement in
KEP#1287
further complicates effective resource requirement calculation as
Pod.Spec.Containers[i].Resources
becomes a desired state field and may not
represent the actual resources in use. The KEP notes that:
Schedulers should use the larger of
Spec.Containers[i].Resources
andStatus.ContainerStatuses[i].ResourcesAllocated
when considering available space on a node.
We will introduce ContainerUse
to represent this value:
ContainerUse(i) = Max(Spec.Containers[i].Resources, Status.ContainerStatuses[i].ResourcesAllocated)
In the absence of KEP 1287, or if the feature is disabled, ContainerUse
is
simply:
ContainerUse(i) = Spec.Containers[i].Resources
The sidecar KEP also changes that calculation to be more complicated as sidecar
containers are init containers that do not terminate. Since init containers
start in order, sidecar resource usage needs to be summed into those init
containers that start after the sidecar. Defining InitContainerUse
as:
InitContainerUse(i) = Sum(sidecar containers with index < i) + Max(Spec.InitContainers[i].Resources, Status.InitContainerStatuses[i].ResourcesAllocated)
allows representing the new formula for a pods resource usage
Max ( Max( each InitContainerUse ) , Sum(Sidecar Containers) + Sum(each ContainerUse) ) + pod overhead
Even now, users still sometimes find how a pod's effective resource requirements are calculated confusing or are just unaware of the formula. The mitigating quality to this is that init container resource requests are usually lower than the sum of non-init container resource requests, and can be ignored by users in those cases. Software that requires accurate pod resource requirement information (e.g. kube-scheduler, kubelet, autoscalers) don't have that luxury. It is too much to ask of users to perform this even more complex calculation simply to know the amount of free capacity they need for a given resource to allow a pod to schedule.
This KEP will not expose the total resource requests field to end user as many decisions on this field need to be made from other KEPs: InPlace pod update and Pod Level resources. We do not want to make it harder for those new KEPs to be implemented by exposing this field prematurely.
Sidecar containers will be used for Pod QoS calculation as all other containers.
The logic in
GetPodQOS
not likely will need changes, but needs to be tested with the sidecar
containers.
In case of a version skew between scheduler and kubelet, or in cases when
scheduler and kubelet has a different value set for the SidecarContainers
feature gate,
calculation of resources required for a Pod will differ between the scheduler
and a kubelet when the sidecar container created.
In case when scheduler "knows" about the sidecar and kubelet doesn't, there unlikely be any issues. Scheduler will calculate resources usage for a Pod that will be equal or more than kubelet will require to run the Pod. So there will be no overbooking.
If scheduler has the SidecarContainers
feature gate disabled, the Pod that has a Sidecar
container will not be admitted as validation of the new field will fail.
We will recommend in documentation to not disable the feature gate on scheduler, while there are any Pods with Sidecar container is running.
NodeResourcesFit scheduler plugin will need to be updated take sidecar container resource request into consideration.
Preliminary code analysis didn't expose any issues introducing sidecar containers. The biggest question is resources reuse for sidecar containers and other init containers, especially in cases of single NUMA node requirements and such. This may be non-trivial. The decision on this is not blocking the KEP though.
From the code, it appears that init containers are treated exactly like application containers so we don't need to change anything from resource management point of view. I found references in the code where all the containers (init containers and application containers) were coalesced before resources (CPUs, memory and devices) are allocated to them. Here are a few examples:
- Container Manager: https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/container_manager_linux.go#L708
- CPU Manager: https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/cpumanager/policy_static.go#L490
- Memory Manager: https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/memorymanager/policy_static.go#L372
- Topology Manager:
- https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/topologymanager/scope.go#L137
- https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/topologymanager/scope_container.go#L52
- https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/topologymanager/scope_pod.go#L58
In Alpha sidecar containers will be terminated as regular containers. No special or additional signals will be supported.
In Beta we have thought about introducing additional termination grace period fields to manage termination duration (draft proposal) and leverage these fields to add reverse order termination of sidecar containers after the primary containers terminate.
However, we decided on an alternative that doesn't require additional fields or hooks while keeping the desired behaviors when Pods with sidecars are terminated. While original approach works better with truly graceful termination where consistency is more important than time taken, proposed approach works for that scenario as well as a more and more popular scenario of limited time to terminate when graceful termination is set by external requirement and Pods needs to do best to gracefully terminate as much as possible (think of a Spot Instances with 30 seconds notification).
Here is the proposed approach:
- Sidecar containers that have a
PreStop
hook will be notified when the Pod has begun terminating by executing thePreStop
hook. This happens at the same time as regular containers, and begins the Pod's termination grace period countdown. - Once the last primary container terminates, the last started sidecar container is notified by
sending a
SIGTERM
signal. - The next sidecar (in reverse order) is notified by sending a
SIGTERM
signal after the previous sidecar container terminates. - This continues until all sidecar containers have terminated, or the Pod's termination grace period expires.
- In the latter case, all remaining containers are notified by a
SIGTERM
, followed by a fixed grace period of 2s and finally terminated. - The Pod will be terminated after that.
Pseudocode for the above:
func terminatePod() {
// notify all sidecar containers with preStop hook, asynchronously
for sidecar in sidecarContainers {
if sidecar has preStop hook {
go execute preStop hook // async
}
}
// notify all containers with preStop hook and then SIGTERM, asynchronously
for container in containers {
if container has preStop hook {
go func(container) { // async
execute preStop hook
send SIGTERM
}
}
}
for {
switch {
case grace period expired:
for anyContainer in sidecarContainers + containers {
if anyContainer is running {
send SIGTERM
}
}
sleep 2s
for anyContainer in sidecarContainer + containers {
if anyContainer is running {
send SIGKILL
}
}
return
case all containers are terminated:
// sidecars are terminated in reverse order
for sidecar in reverse(sidecarContainers) {
// sidecar is already terminating, let it finish
if sidecar is terminating {
break
}
// next sidecar to terminate
else if sidecar is running {
send SIGTERM
break
}
}
sleep 1s
case all sidecarContainers are terminated:
return
default:
sleep 1s
}
}
}
It is worth noting that, like with regular containers, PreStop
hook must complete before the SIGTERM
signal to stop the sidecar container can be sent. Therefore, ordering and graceful termination of sidecars
can only be guaranteed if the PreStop
hook completes within the Pod's termination grace period.
Sidecars continue to be restarted until they enter the Terminated
state which they are notified
by a SIGTERM
signal. This is to ensure that sidecars that fail are restarted until the TGPS expires.
We might postpone running the livenessProbe
for restarted sidecar containers during termination
until GA, depending on the implementation complexity.
If we compare this to the initial proposal, the following behaviors are preserved:
- Sidecars should not begin termination until all primary containers have
terminated.
- Implicit in this is that sidecars should continue to be restarted until all primary containers have terminated.
- Sidecars should terminate serially and in reverse order. I.e. the first sidecar to initialize should be the last sidecar to terminate.
The additional benefits of this approach comparing to initial proposal:
- If graceful termination period is short, and mostly taken by the main container, the sidecar containers has more time to gracefully terminate, for example, clear up buffers of logging container.
- There is absolutely no change in behavior of main containers - they start graceful termination at exact same time as before and can utilize as much of the graceful termination period as they need. The Pod graceful termination period semantic also stay unchanged.
This behavior needs to be adjusted: #3676
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
<package>
:<date>
-<test coverage>
There will be many packages touched in process. A few that easy to identify by areas of a change:
Admission:
k8s.io/kubernetes/pkg/kubelet/lifecycle
:61.7
Enable probes for sidecar containers
k8s.io/kubernetes/pkg/kubelet/prober
:02/07/2023
-79.9
Include sidecars into QoS decision:
k8s.io/kubernetes/pkg/kubelet/qos
:02/07/2023
-100
Include sidecar in resources calculation and policy decisions:
k8s.io/kubernetes/pkg/kubelet/cm/topologymanager
:02/07/2023
-93.2
k8s.io/kubernetes/pkg/kubelet/cm/memorymanager
:02/07/2023
-81.2
k8s.io/kubernetes/pkg/kubelet/cm/cpumanager
:02/07/2023
-86.4
Update OOM score adjustment:
k8s.io/kubernetes/pkg/kubelet/oom
:02/07/2023
-57.1
No integration tests are planned. We'll cover this with e2e_node tests.
- Test failures: https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=SidecarContainers
- All related tests can be filtered with the SidecarContainers
The sidecar container can expose ports and can be used to handle external traffic.
- Pod with sidecar container exposing the port can receive Service traffic to this Port
- Pod with sidecar container exposing the port with readiness probe marks the Endpoint not ready when probe fails and switched back on when readiness probe succeed
- Pod with sidecar container exposing the port can receive Service traffic to this Port during Pod termination (during graceful termination period)
- Init containers should start after the previous init container has completed
- Regular containers should start in parallel after all init containers have completed
- Restart behavior of the init containers
- Init containers should start after the previous restartable init container has started
- Regular containers should start in parallel after all regular init containers have completed and all restartable init containers have started
- Restartable init containers should restart always
- It should restart the containers in the right order after the node reboot
- It should not restart any completed init containers after the kubelet restart
From @rata. Interesting test cases where the API server is down and kubelet wants to start a pod with sidecars. That was problematic in the previous sidecar KEP iterations.
Let say the API server is up and a pod is being started by the kubelet
- The kubelet still needs to start 3 sidecars in the pod
- The API server crashes (it is not restarted yet)
- The kubelet continues to try to start the sidecars
- The pod is started correctly
- The API server becomes reachable again
I think testing this scenario works is important in early phases (alpha), as in the past that proved to be tricky. The kubelet is authoritative on which containers are started and sends it to the API server, but I hit some bugs in the past where if the API server was down, we couldn't read that a new container was ready from the kubelet. The code to do it was there even if API server was down, but something was not working and I didn't debug it. And the end result was that the pod startup didn't finish until the API server was up again, as that is when we realized from the kubelet that the sidecar was ready.
I think this code has changed since 1.17 when I tested this, but I don't know if this issue is fixed. It is a non-trivial scenario that, if it happens to need more serious code changes in the kubelet to handle it correctly, it will be good to know in early stages of the KEP IMHO.
- Validate that the pod overhead will be accounted for when scheduling a Pod with sidecar container.
- Validate that LimitRanger will apply defaults and consider limits for the Pod with the sidecar containers.
- Kubelet and control plane reject the Pod with the new field if the feature gate is disabled.
- kubelet and control plane reject the Pod with the new field if the feature gate was disabled AFTER the Pod with the new field was added.
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
- E2e testing of the existing scenarios with and without the feature gate turned on
- Implement proper termination ordering.
- Add tests with feature activation and deactivation (see Feature Enablement and Rollback).
- All known issues are fixed
- Production use feedback addressed
Existing sidecars (implemented as regular containers) will still work as
intended, as in the past we don't recognize the new field restartPolicy
today.
Upgrade will not change any current behaviors.
First, there will be no effect on any workload that doesn't use a new field. Any combination of feature gate enabled/disabled or version skew will work as usual for that workload.
When the new functionality wasn't yet used, downgrade will not be affected.
Versions of Kubernetes that doesn't have this feature implemented will ignore
and strip out the new field initContainers
.
Pods that has already been created will stay being scheduled after the downgrade - not be rejected by control plane nor by kubelet. Both will treat the sidecar container as an Init container. This may render the Pod unusable as it will stuck in initialization forever - sidecar container are never exiting. This behavior has been documented for Alpha release, but we don't see it as a major issue requiring to wait for 3 releases so kubelet will have the logic to reject such Pods when the feature gate is disabled to keep Downgrade safe.
Note, We have implemented logic for the kubelet to reject Pods with sidecar containers when feature gate is turned off. For the control plane - kube-apiserver is dropping the field (if it wasn't set before) and kube-scheduler is keeping pods with the field set unschedulable. See Upgrade/downgrade testing section.
Workloads will have to be deleted and recreated with the old way of handling sidecars. Once there is no more Pods using sidecars, node can be downgraded without side effects.
If downgrade happening from the version with the feature enabled to the previous version that has this feature support, but feature gate is disabled, kubelet and control place will reject these Pods.
Note, downgrade requires node drain. So we will not support scenarios when Pod already running on the node will need to be handled by the restarted kubelet that doesn't know about the sidecar containers.
Version skew is possible between the control plane and worker nodes as both
should be aware of the new field used to flag sidecars inside initContainers
.
Therefore all cluster nodes, including control plane nodes, must be upgraded before the user can deploy sidecars using the new syntax.
Also, since the feature flag applies to both kubelet and the control plane, similarly all cluster nodes need to have it enabled before deploying Pods with sidecars.
For the scenarios when the feature gate is disabled on control plane, but not disabled on kubelet, users will not be able to schedule Pods with the new field.
For the scenarios when the feature gate is disabled on kubelet, but enabled on control plane, users will be able to create these Pods, but kubelet will reject those.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
SidecarContainers
- Components depending on the feature gate:
- kubelet
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- Feature gate name:
No.
Yes. Pods that had sidecars will need to be deleted and recreated without them.
The feature gate disablement will require the kubelet restart. When kubelet will start, it will fail to reconcile the Pod with the new fields and will clean up all running containers.
If version downgrade is involved, the node must be drained. All Pods with the new field will not be accepted by kubelet once feature was disabled.
If Pods were in Pending State rejected by kubelet due to "unknown" field to be scheduled, they may become scheduleable again and will work as expected.
Not yet, but it is planned/required to add them before graduation to Beta.
Rollout could fail for multiple reasons:
- webhooks that are not recompiled with the new field will strip it out
- bug in the resource calculation or CPU reservation logic could render the Pod unschedulable
- bug in the kubelet affecting the pod lifecycle could cause the Pod to be stuck in initialization
However, we have tried to maintain a high coverage of unit tests to ensure we catch these.
Rollback can fail if a Pod with sidecars is scheduled on a node where the feature is disabled. In that case the Pod will be rejected by kubelet and will be stuck in Pending state. Therefore, we advise to first disable the feature gate on the control plane and then proceed with the nodes.
Running workloads are not impacted.
Pods with sidecars might take a long time to exit and exceed the TGPS, a new event should be added in beta to help administrators diagnose this issue. Rather than rolling back the feature, they should work on the graceful termination of their main containers to ensure sidecars have enough time to be notified and exit on their own.
- Metrics
- Metric name: kubelet_started_containers_errors_total
- Type: Counter
- Labels:code, container_type (should be
init_container
) - Components exposing the metric:
kubelet-metrics
- Symptoms: high number of errors indicates that the kubelet is unable to start the sidecar containers
- Metric name: kubelet_started_containers_errors_total
- API objects
- Pods stuck in Pending state of Init container running.
- Type: API objects
- Symptoms: when the new field
restartPolicy:Always
was mistakenly stripped out by a webhook, Pod will get stuck.
- Pods stuck in Pending state of Init container running.
Upgrade->downgrade->upgrade testing was done manually using the following steps:
Kubelet specific:
- Deploy k8s 1.29-alpha
- Enable the
SidecarContainers
feature gate on the control plane and kubelet - Deploy a Pod with sidecar containers using a Deployment
- Disable the
SidecarContainers
feature gate on the kubelet (requires a restart) - Drain the node
- Pod is rejected by kubelet
- Enable the
SidecarContainers
feature gate on the kubelet (requires a restart) - Pod is scheduled and works as expected
Control plane specific:
- Deploy k8s 1.29-alpha
- Enable the
SidecarContainers
feature gate on the control plane and kubelet - Deploy a Pod with sidecar containers using a Deployment
- Disable the
SidecarContainers
feature gate on the control plane - Delete the Pod
- Pod is created without the new field - init containers are not recognized as sidecars and block the Pod in initialization
- Modify the Deployment by moving the sidecar containers to the regular containers section
- Pod is scheduled and works (without the sidecar support)
- Enable the
SidecarContainers
feature gate on the control plane - Delete the Pod
- Pod is scheduled and works (without the sidecar support)
- Modify the Deployment by moving the sidecar containers to the init containers section
- Pod is scheduled and works (with the sidecar support)
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
By checking if .spec.initContainers[i].restartPolicy
is set to OnFailure
or Always
.
End users can check components that are using the new feature, such as Istio, if istio-proxy runs as a sidecar container:
$ kubectl get pod -o "custom-columns="\
"NAME:.metadata.name,"\
"INIT:.spec.initContainers[*].name,"\
"CONTAINERS:.spec.containers[*].name"
NAME INIT CONTAINERS
sleep-7656cf8794-8fhdk istio-init,istio-proxy sleep
- number of running containers should not change by more than 10% throughout the day, as measured by the number of running containers at the beginning and end of the day
- error rate for containers of type init_container should be less than 1%, as measured by the number of errors divided by the total number of init_container containers
- number of events indicating that TGPS has been exceeded should be less than 10 per day, as measured by the number of events logged in the kubelet log
- 99% of jobs with sidecars should complete successfully, as measured by the number of jobs that complete successfully divided by the total number of jobs with sidecars
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: kubelet_running_containers
- Type: Gauge
- Labels:container_state
- Components exposing the metric:
kubelet-metrics
- Metric name: kubelet_started_containers_errors_total
- Type: Counter
- Labels:code, container_type (should be
init_container
) - Components exposing the metric:
kubelet-metrics
- Metric name: kubelet_running_containers
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
No.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Graceful Pod termination might take longer with sidecars since their exit sequence starts after the last main container has stopped. The impact should be negligible because the TGPS is enforced in all cases.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No, since the KEP only enable a new way to run containers as sidecars instead of regular containers. Resource consumption can even be lower since various tricks using emptyDir volumes to perform synchronization (as with istio-proxy) are no longer needed.
Nothing changes compared to the current kubelet behavior.
- Main containers don't exit within TGPS, leading to sidecars being terminated
- Detection: high number of events indicating TGPS has been exceeded
- Mitigations: ensure timely termination of main containers
- Diagnostics: Events
- Testing: TBD
- Main container or sidecar use a preStop hook consuming TGPS, leading to remaining sidecars being terminated
- Detection: high number of events indicating TGPS has been exceeded
- Mitigations: ensure preStop hooks are not delaying termination
- Diagnostics: Events
- Testing: TBD
- Sidecar container uses a preStop hook that make the container exit during Pod shutdown, sidecar is restarted, leading
to a CrashLoopBackOff
- Detection: sidecar in CrashLoopBackOff during termination
- Mitigations: ensure preStop hooks are not making the container to exit, document best practices
- Diagnostics: Events
- Testing: TBD
None.
- 2018-05-14: First proposal.
- 2023-06-09: Target 1.28 for Alpha.
- 2023-07-08: Alpha implementation merged.
- 1.29: feature is in Beta
In the Alpha version, no other values beyond Always will be acceptable. Going
forward the following semantic may be supported for other values of
restartPolicy field of init containers. The restartPolicy
may be set to one of:
Never
, OnFailure
and Always
. The meaning of these values is similar to
the Pod 'restartPolicy' flag:
Never
indicates that if the container returns a non-zero exit code, that pod initialization has failed. This value can be supported for Pods with Never and OnFailurerestartPolicy
indicating the critical Initialization step. Supporting this value on Pods with the therestartPolicy
Always
is not planned as it changes the semantic of this policy and may affect how deployments control the # of Pods in this deployment.OnFailure
indicates the init container will be restarted until it completes successfully. This value may be used for Pods with therestartPolicy
Never
, indicating that the Initialization step is flaky and should be retried.Always
indicates that the container will be considered initialized the moment the startup probe indicates it is started and will be restarted if it exits for any reason until the pod terminates.- (If the initContainer's
restartPolicy
is unset, the pod'srestartPolicy
is used to determine the init container restart policy. For Always, OnFailure is used, for OnFailure - OnFailure, for Never - Never).
Since init container and regular containers share the schema, restartPolicy
field will be available for regular containers as well. From scenarios
perspective, the first set of scenarios would be for containers that can only
override the restartPolicy
to “higher” value. From Never
to OnFailure
or Always
,
from OnFailure
to Always
. An override from Always
to Never
will not be
allowed.This will eliminate situations when kubelet restarts and cannot decide
whether the container overridden value to Never
will need to be restarted or not
for the Pod with the restart policy Always
. It is also likely that we will
require at least one container to have a restartPolicy
matching the Pod
restartPolicy
. We can re-evaluate this in future if scenarios will arise.
Note, by implementing these scenarios, the meaning of restartPolicy:Always
for
containers on a Pod with the restartPolicy:Never, will not exactly be the
meaning of the restartPolicy:Always
for the Pod with the restartPolicy:Always
.
For the first case, containers will not block the Pod termination. For the
second, they will. In other words, the container with restartPolicy:Always
will
effectively become a sidecar container similar to init container with the same
field value.
Additional considerations can be found in the older document that was a previous attempt to implement the sidecar containers: https://docs.google.com/document/d/1gX_SOZXCNMcIe9d8CkekiT0GU9htH_my8OB7HSm3fM8/edit#heading=h.71o8fcesvgba.
As part of this KEP, we needed to decide on a condition that Pod needs to meet to be recognized as "started". This will be used to sequence init containers properly and give enough time for sidecar container to complete it's initialization.
The easiest signal we have today is a Container Ready state. Semantically this signal means that the container is fully prepared to handle traffic.
The risk of using of Ready state as a startup completed condition, is:
- today the ready state is used to control traffic, but not for container lifecycle (as oppose to startup and liveness probes). If Container's readiness probe depends on external services, and those services are not yet available or depend on other containers in the Pod, the sidecar container will never be started.
- it may be harder semantically to understand what does the not Ready Sidecar container mean if it became not ready after it has started, while other Init containers are being run.
- the scenario when sidecar container is started, but wants to delay the readiness for later would not be possible to implement.
The second possible signal we can use is Startup probes. Kubelet will be waiting
for startup probe to succeed to mark the container as started. Readiness probes
will only start when Pod's initialization stage will complete and regular
containers will start. There is a ContainerStatus
field Started
indicating
this state.
The easiest approach would be to simply wait for postStart hook to complete.
This is already implemented for regular containers - they are started
sequentially with the postStart
hook being a blocking call.
Using postStart
has some debugging challenges in how logs are reported (they
aren't) and the status is displayed. If this option will be used, we should
consider improving this situation.
With postStart
simple sidecars will start faster than if any probes are used.
However it brings additional challenges for sidecar authors as majority of logic
may be duplicated and needs to be authored, e.g. a shell binary + a loop script
to retry curl
.
This alternative was rejected for the following reasons:
- it introduces a whole new collection of containers, which is a big change doesn’t allow different startup sequencing - before or after init containers. Before or after a specific init container. Making a single decision for the entire collection of containers and all scenarios was not possible.
- A single semantic for the entire collection of containers may make it harder to introduce new properties to containers in this collection without changing its semantic. It was preferable to introduce new properties representing specific behaviors to the existing containers
- Each time a new collection of containers is added to Pod, it increases the size of the OpenAPI schema of pods and all resources containing PodTemplate significantly, which has negative downstream implications on systems consuming OpenAPI schemas and on CRDs that embed PodTemplate.
This alternative is trying to replicate systemd dependencies. The alternative was rejected for the following reasons:
- Universal sidecars injection is not possible for such a system.
- There were lack of scenarios requiring more than general containers sequencing and two stages - Init and Main execution.
This alternative introduces semantically meaningful phases like Network unavailable, Network configured, Storage configured, etc. and fit containers in these phases. This alternative was rejected because it was hard to impossible to fit all containers and scenarios into predefined buckets like this. And the opposite was also true - many scenarios didn’t need this level of complexity and separation.
This alternative (#3582) was one of the attempts to scope a problem to the absolute minimum needed to solve Jobs with sidecars termination problems. The idea is to implement a specific property on a container that will terminate the whole Pod when a specific container exits. This idea was rejected:
- Automatic injection is hard
- Only a subset of scenarios is being solved and introduction of built-in sidecar containers support will disregard most of the work
- There is a desire to allow to distinguish sidecar containers from other containers
The idea of this alternative is to allow configuring some containers for the Pod using another object, so kubelet will combine a resulting Pod definition as a multi-part object. This alternative offers great properties of security boundaries between objects allowing different permissions sets to different containers.
Scenario this and similar alternative may cover are:
More effectively separate access to privileged / secure resources between containers. Filesystem resources, secrets, or more privileged process behaviors can be separated into a container that has higher access and is available before the regular serving/job container and until it has completed.
This is somewhat part of the service mesh use case (especially those that perform privileged iptables manipulation and hide access to secrets).
Other scenarios:
- Run an agent that exchanges pod identity information to gain a time or space limited security credential (such as an SSL cert, remote token, whatever) alongside a web serving container that has no access to the pod identity credential, and thus can only expose the bound credential if compromised.
- Start a FUSE daemon in a sidecar that mounts an object storage bucket into one of the shared pod emptydir volumes - have another container running with much lower privileges that can use that
- Run a container image build with privileges in a sidecar container that responds to commands on a unix socket shared in a pod emptydir and takes commands from an unprivileged container to snapshot the state of the container.
However this alternative was rejected as the change introduced is beyond the scope of the sidecar KEP and will be needed for many more scenarios. Implementing sidecar containers as proposed in this KEP does not make this alternative harder to implement in future.
A little more context can be found in this document: https://docs.google.com/document/d/10xqIf19HPPUa58GXtKGO7D1LGX3t4wfmqiJUN4PJfSA/edit#heading=h.xdsy4mbwha7q