-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes pod operator: More than one pod running with labels #10544
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Looking at the code I notice a discrepancy between master and release 1.10.11. In master I find this: if len(pod_list.items) == 1 and \
self._try_numbers_do_not_match(context, pod_list.items[0]) and \
self.reattach_on_restart:
self.log.info("found a running pod with labels %s but a different try_number"
"Will attach to this pod and monitor instead of starting new one", labels)
final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
elif len(pod_list.items) == 1:
self.log.info("found a running pod with labels %s."
"Will monitor this pod instead of starting new one", labels)
final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
else:
self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
final_state, _, result = self.create_new_pod_for_operator(labels, launcher) on 1.10.11: if len(pod_list.items) == 1 and \
self._try_numbers_do_not_match(context, pod_list.items[0]) and \
self.reattach_on_restart:
self.log.info("found a running pod with labels %s but a different try_number"
"Will attach to this pod and monitor instead of starting new one", labels)
final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
elif len(pod_list.items) == 1:
self.log.info("found a running pod with labels %s."
"Will monitor this pod instead of starting new one", labels)
final_state, result = self.monitor_launched_pod(launcher, pod_list[0])
else:
final_state, _, result = self.create_new_pod_for_operator(labels, launcher) So in the first use case on master we will always attach to the pod, and on 1.10.11 we will create a new pod. I assume that 1.10.11 is the behaviour you want however. Because in master it will never ever launch a new pod when the pod has failed it will just monitor the old one and fail again. |
I just noticed that what I thought to do filtering on pending and running pods break the flow when restarting the scheduler. Since it will not find your completed pod. I am thinking the flow should change into:
|
Looks like it is fixed by: #10230 |
We've tested this with
The pod is in |
The problem still exists on Airflow version 1.10.14+composer |
Happens on |
I guess it happens when I use manual retry. |
This should be fixed by #18070. |
Apache Airflow version: 1.10.11
Kubernetes version (if you are using kubernetes) (use
kubectl version
):Environment:
uname -a
): not relevantWhat happened:
When launching a failing kubernetesPodOperator job with 4 retries you get the following message on the second retry. This is because the failed pods still exist on kubernetes and thus the list call in the pod operator:
Also returns the failed objects. A fix would be to filter on only pods that are not completed or failed.
What you expected to happen:
To start my container anew.
How to reproduce it:
To reproduce start a dag with a kubernetespodoperator with retries that fails. Set the retries to more then 2 as it starts happening on the thrid try.
Anything else we need to know:
I will make a PR myself to fix this.
The text was updated successfully, but these errors were encountered: