Kubernetes pod operator: More than one pod running with labels #10544

stijndehaes · 2020-08-25T09:23:01Z

Apache Airflow version: 1.10.11

Kubernetes version (if you are using kubernetes) (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.9-eks-4c6976", GitCommit:"4c6976793196d70bc5cd29d56ce5440c9473648e", GitTreeState:"clean", BuildDate:"2020-07-17T18:46:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): not relevant
Kernel (e.g. uname -a): not relevant
Install tools: not relevant
Others: not relevant

What happened:

When launching a failing kubernetesPodOperator job with 4 retries you get the following message on the second retry. This is because the failed pods still exist on kubernetes and thus the list call in the pod operator:

pod_list = client.list_namespaced_pod(self.namespace, label_selector=label_selector)

Also returns the failed objects. A fix would be to filter on only pods that are not completed or failed.

[2020-08-25 08:51:17,856] {taskinstance.py:1150} ERROR - Pod Launching failed: More than one pod running with labels: dag_id=sample-python-failing,execution_date=2020-08-24T0300000000-71ba3e273,task_id=ingest-weather
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 276, in execute
    '{label_selector}'.format(label_selector=label_selector))
airflow.exceptions.AirflowException: More than one pod running with labels: dag_id=sample-python-failing,execution_date=2020-08-24T0300000000-71ba3e273,task_id=ingest-weather

What you expected to happen:

To start my container anew.

How to reproduce it:
To reproduce start a dag with a kubernetespodoperator with retries that fails. Set the retries to more then 2 as it starts happening on the thrid try.

Anything else we need to know:

I will make a PR myself to fix this.

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-08-25T09:23:03Z

Thanks for opening your first issue here! Be sure to follow the issue template!

stijndehaes · 2020-08-25T09:38:15Z

Looking at the code I notice a discrepancy between master and release 1.10.11.

In master I find this:

            if len(pod_list.items) == 1 and \
                    self._try_numbers_do_not_match(context, pod_list.items[0]) and \
                    self.reattach_on_restart:
                self.log.info("found a running pod with labels %s but a different try_number"
                              "Will attach to this pod and monitor instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
            elif len(pod_list.items) == 1:
                self.log.info("found a running pod with labels %s."
                              "Will monitor this pod instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
            else:
                self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)

on 1.10.11:

            if len(pod_list.items) == 1 and \
                    self._try_numbers_do_not_match(context, pod_list.items[0]) and \
                    self.reattach_on_restart:
                self.log.info("found a running pod with labels %s but a different try_number"
                              "Will attach to this pod and monitor instead of starting new one", labels)
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
            elif len(pod_list.items) == 1:
                self.log.info("found a running pod with labels %s."
                              "Will monitor this pod instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list[0])
            else:
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)

So in the first use case on master we will always attach to the pod, and on 1.10.11 we will create a new pod. I assume that 1.10.11 is the behaviour you want however. Because in master it will never ever launch a new pod when the pod has failed it will just monitor the old one and fail again.

stijndehaes · 2020-08-25T09:44:21Z

I just noticed that what I thought to do filtering on pending and running pods break the flow when restarting the scheduler. Since it will not find your completed pod. I am thinking the flow should change into:

search for pod with the same labels (including the try_id) if it is found -> Attach to the new pod
If it is not found we should probably always start a new one. I don't think it makes sense to attach to the pod of another try?

stijndehaes · 2020-08-25T09:48:21Z

Looks like it is fixed by: #10230

jrzdudek · 2020-11-18T13:18:06Z

We've tested this with 1.10.12 which includes the #10230 fix, and the issue still exists. When the first job errors out, the subsequent retries reattach to the errored out pod.

found a running pod with labels {'dag_id': 'xxxx, 'task_id': 'xxxxx', 'execution_date': '2020-09-02T1000000000-7b0e1e2af', 'try_number': '2'} but a different try_number. Will attach to this pod and monitor instead of starting new one

The pod is in Error status in the cluster so should not be considered a candidate for re-attachment.

tduriez-bc · 2021-03-29T07:07:00Z

The problem still exists on Airflow version 1.10.14+composer

matan129 · 2021-04-19T12:52:18Z

Happens on 2.0.1 as well. Any updates?

tfedyanin · 2021-07-07T11:59:11Z

I guess it happens when I use manual retry.
There is no label try_number=1, that provide uniqueness between runs. So on first manual retry pod is created correctly. And there is violation of uniqueness on next manual retries.

jedcunningham · 2021-09-15T14:59:06Z

This should be fixed by #18070.

stijndehaes added the kind:bug This is a clearly a bug label Aug 25, 2020

stijndehaes closed this as completed Aug 25, 2020

wojsamjan mentioned this issue Jul 28, 2021

Add info log how to fix: More than one pod running with labels #17285

Closed

wojsamjan mentioned this issue Aug 5, 2021

Add info log how to fix: More than one pod running with labels #17445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes pod operator: More than one pod running with labels #10544

Kubernetes pod operator: More than one pod running with labels #10544

stijndehaes commented Aug 25, 2020

boring-cyborg bot commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

jrzdudek commented Nov 18, 2020

tduriez-bc commented Mar 29, 2021

matan129 commented Apr 19, 2021 •

edited

Loading

tfedyanin commented Jul 7, 2021

jedcunningham commented Sep 15, 2021

Kubernetes pod operator: More than one pod running with labels #10544

Kubernetes pod operator: More than one pod running with labels #10544

Comments

stijndehaes commented Aug 25, 2020

boring-cyborg bot commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

stijndehaes commented Aug 25, 2020

jrzdudek commented Nov 18, 2020

tduriez-bc commented Mar 29, 2021

matan129 commented Apr 19, 2021 • edited Loading

tfedyanin commented Jul 7, 2021

jedcunningham commented Sep 15, 2021

matan129 commented Apr 19, 2021 •

edited

Loading