produce error if encounter ImagePullBackoff #522

kdelee · 2022-01-18T22:37:04Z

fixes: #521

This correctly has the work unit fail if the container has ImagePullBackOff error

{'DVxy89tF': {'Detail': 'Error creating pod: container failed to start, ImagePullBackOff',
              'ExtraData': {'Command': '',
                            'Image': '',
                            'KubeConfig': '',
                            'KubeNamespace': 'controller-integration-1641911565',
                            'KubePod': '',
                            'Params': '',
                            'PodName': 'kube-6tw5v'},
              'State': 3,
              'StateName': 'Failed',
              'StdoutSize': 0,
              'WorkType': 'kube'}}

This correctly has the work unit fail if the container has ImagePullBackOff error {'DVxy89tF': {'Detail': 'Error creating pod: container failed to start with ' 'ImagePullBackoff error, ImagePullBackOff', 'ExtraData': {'Command': '', 'Image': '', 'KubeConfig': '', 'KubeNamespace': 'controller-integration-1641911565', 'KubePod': '', 'Params': '', 'PodName': 'kube-6tw5v'}, 'State': 3, 'StateName': 'Failed', 'StdoutSize': 0, 'WorkType': 'kube'}}

fosterseth · 2022-01-19T17:06:47Z

@shanemcd I'd like to get your thoughts on this one before merging. Basically the idea is that if the pod cannot enter "running" state due to ImagePullError (incorrect image name in podspec, incorrect credentials are supplied, registry is down, etc), then fail the work unit. The system will retry the image pull a number of times, currently set by imagePullBackOffRetries := 3 in the code.

This feature is similar to the pod_pending_timeout option we currently have in receptor. However, there is a good use case where users will want to set pod_pending_timeout to something much longer, say 1 hour, but still duck out early if images cannot be pulled.

For example, k8s might not be starting pods due to resources not being available (if cpu/memory requests are set in the podspec), and thus pods will stay in pending until resources are freed up. In that situation, users will want to bump the pod_pending_timeout from 5 minutes to something much longer, but having to wait 1 hour to know that an image cannot be pulled because you typed the image name incorrectly is a bad experience.

So this feature targets the ImagePullError case specifically, whereas the pod_pending_timeout is a general catch-all for any situation where the pod cannot be started.

shanemcd · 2022-01-19T17:30:24Z

pkg/workceptor/kubernetes.go

 				}
 			}
 		}
+
+		return false, nil


Under which conditions would we hit this? Should the error really be nil if the pod can't start?

PodRunningandReady is being passed into a watcher function,

ev, err := watch2.UntilWithSync(ctxPodReady, lw, &corev1.Pod{}, nil, podRunningAndReady())

This will call PodRunningAndReady in a loop, one for each "event" it pulls from the k8s api. It only terminates with PodRunningAndReady returns True, or returns an error

so returning false, nil means, "no the pod isn't ready, but there wasn't an error, so keep polling events from the k8s api"

kdelee requested review from fosterseth and shanemcd January 18, 2022 22:37

kdelee force-pushed the fail-image-not-pullable branch 2 times, most recently from 58bb832 to c44c3cf Compare January 19, 2022 01:04

kdelee mentioned this pull request Jan 19, 2022

Add resource requests to default podspec ansible/awx#11559

Merged

kdelee closed this Jan 19, 2022

kdelee reopened this Jan 19, 2022

kdelee force-pushed the fail-image-not-pullable branch from c44c3cf to 1ddc0ec Compare January 19, 2022 15:23

kdelee requested a review from eqrx January 19, 2022 15:33

fosterseth approved these changes Jan 19, 2022

View reviewed changes

shanemcd reviewed Jan 19, 2022

View reviewed changes

shanemcd approved these changes Jan 20, 2022

View reviewed changes

shanemcd merged commit 2fb942e into ansible:devel Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

produce error if encounter ImagePullBackoff #522

produce error if encounter ImagePullBackoff #522

kdelee commented Jan 18, 2022 •

edited

Loading

fosterseth commented Jan 19, 2022

shanemcd Jan 19, 2022

fosterseth Jan 19, 2022

produce error if encounter ImagePullBackoff #522

produce error if encounter ImagePullBackoff #522

Conversation

kdelee commented Jan 18, 2022 • edited Loading

fosterseth commented Jan 19, 2022

shanemcd Jan 19, 2022

Choose a reason for hiding this comment

fosterseth Jan 19, 2022

Choose a reason for hiding this comment

kdelee commented Jan 18, 2022 •

edited

Loading