Pods cannot be restarted when using scheduling gates #2937

bacherfl · 2024-02-01T09:31:47Z

After the initial deployment of a pod for a workload (i.e. when the related KeptnWorkloadVersion is in a completed state), we have the problem that pods related to that workload stay stuck in a pending state when restarting them.

This is because the mutating webhook always adds the the scheduling gate for the pod (see

lifecycle-toolkit/lifecycle-operator/webhooks/pod_mutator/pod_mutating_webhook.go

Line 94 in 9095a00

    
           if scheduled := handleScheduling(a.SchedulingGatesEnabled, a.Log, pod); scheduled {

), even though the related KeptnWorkloadVersion is already in a completed state.

In this case, also the fix in #2926 will not help, since the replica set of a pod does not change if it is restarted.

A potential solution for this is to do a lookup of a matching KeptnWorkloadVersion for a pod before adding the scheduling gate, as done in the scheduler:

lifecycle-toolkit/scheduler/pkg/klcpermit/workflow_manager.go

Line 75 in 9095a00

    
           func (sMgr *WorkloadManager) Permit(ctx context.Context, pod *corev1.Pod) Status {

bacherfl added bug Something isn't working lifecycle-operator labels Feb 1, 2024

bacherfl self-assigned this Feb 1, 2024

keptn-bot added this to Keptn Lifecycle Toolkit Feb 1, 2024

bacherfl mentioned this issue Feb 2, 2024

fix(lifecycle-operator): introduce separate controller for removing scheduling gates from pods #2946

Merged

bacherfl closed this as completed in #2946 Feb 7, 2024

github-project-automation bot moved this to ✅ Done in Keptn Lifecycle Toolkit Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods cannot be restarted when using scheduling gates #2937

Pods cannot be restarted when using scheduling gates #2937

bacherfl commented Feb 1, 2024

Pods cannot be restarted when using scheduling gates #2937

Pods cannot be restarted when using scheduling gates #2937

Comments

bacherfl commented Feb 1, 2024