-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
produce error if encounter ImagePullBackoff #522
Conversation
58bb832
to
c44c3cf
Compare
This correctly has the work unit fail if the container has ImagePullBackOff error {'DVxy89tF': {'Detail': 'Error creating pod: container failed to start with ' 'ImagePullBackoff error, ImagePullBackOff', 'ExtraData': {'Command': '', 'Image': '', 'KubeConfig': '', 'KubeNamespace': 'controller-integration-1641911565', 'KubePod': '', 'Params': '', 'PodName': 'kube-6tw5v'}, 'State': 3, 'StateName': 'Failed', 'StdoutSize': 0, 'WorkType': 'kube'}}
c44c3cf
to
1ddc0ec
Compare
@shanemcd I'd like to get your thoughts on this one before merging. Basically the idea is that if the pod cannot enter "running" state due to ImagePullError (incorrect image name in podspec, incorrect credentials are supplied, registry is down, etc), then fail the work unit. The system will retry the image pull a number of times, currently set by This feature is similar to the pod_pending_timeout option we currently have in receptor. However, there is a good use case where users will want to set pod_pending_timeout to something much longer, say 1 hour, but still duck out early if images cannot be pulled. For example, k8s might not be starting pods due to resources not being available (if cpu/memory requests are set in the podspec), and thus pods will stay in pending until resources are freed up. In that situation, users will want to bump the pod_pending_timeout from 5 minutes to something much longer, but having to wait 1 hour to know that an image cannot be pulled because you typed the image name incorrectly is a bad experience. So this feature targets the ImagePullError case specifically, whereas the pod_pending_timeout is a general catch-all for any situation where the pod cannot be started. |
} | ||
} | ||
} | ||
|
||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions would we hit this? Should the error really be nil if the pod can't start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodRunningandReady is being passed into a watcher function,
ev, err := watch2.UntilWithSync(ctxPodReady, lw, &corev1.Pod{}, nil, podRunningAndReady())
This will call PodRunningAndReady in a loop, one for each "event" it pulls from the k8s api. It only terminates with PodRunningAndReady returns True, or returns an error
so returning false, nil
means, "no the pod isn't ready, but there wasn't an error, so keep polling events from the k8s api"
fixes: #521
This correctly has the work unit fail if the container has ImagePullBackOff error