Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes pod operator: More than one pod running with labels #10544

Closed
stijndehaes opened this issue Aug 25, 2020 · 9 comments
Closed

Kubernetes pod operator: More than one pod running with labels #10544

stijndehaes opened this issue Aug 25, 2020 · 9 comments
Labels
kind:bug This is a clearly a bug

Comments

@stijndehaes
Copy link
Contributor

Apache Airflow version: 1.10.11

Kubernetes version (if you are using kubernetes) (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.9-eks-4c6976", GitCommit:"4c6976793196d70bc5cd29d56ce5440c9473648e", GitTreeState:"clean", BuildDate:"2020-07-17T18:46:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): not relevant
  • Kernel (e.g. uname -a): not relevant
  • Install tools: not relevant
  • Others: not relevant

What happened:

When launching a failing kubernetesPodOperator job with 4 retries you get the following message on the second retry. This is because the failed pods still exist on kubernetes and thus the list call in the pod operator:

pod_list = client.list_namespaced_pod(self.namespace, label_selector=label_selector)

Also returns the failed objects. A fix would be to filter on only pods that are not completed or failed.

[2020-08-25 08:51:17,856] {taskinstance.py:1150} ERROR - Pod Launching failed: More than one pod running with labels: dag_id=sample-python-failing,execution_date=2020-08-24T0300000000-71ba3e273,task_id=ingest-weather
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 276, in execute
    '{label_selector}'.format(label_selector=label_selector))
airflow.exceptions.AirflowException: More than one pod running with labels: dag_id=sample-python-failing,execution_date=2020-08-24T0300000000-71ba3e273,task_id=ingest-weather

What you expected to happen:

To start my container anew.

How to reproduce it:
To reproduce start a dag with a kubernetespodoperator with retries that fails. Set the retries to more then 2 as it starts happening on the thrid try.

Anything else we need to know:

I will make a PR myself to fix this.

@stijndehaes stijndehaes added the kind:bug This is a clearly a bug label Aug 25, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 25, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@stijndehaes
Copy link
Contributor Author

Looking at the code I notice a discrepancy between master and release 1.10.11.

In master I find this:

            if len(pod_list.items) == 1 and \
                    self._try_numbers_do_not_match(context, pod_list.items[0]) and \
                    self.reattach_on_restart:
                self.log.info("found a running pod with labels %s but a different try_number"
                              "Will attach to this pod and monitor instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
            elif len(pod_list.items) == 1:
                self.log.info("found a running pod with labels %s."
                              "Will monitor this pod instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
            else:
                self.log.info("creating pod with labels %s and launcher %s", labels, launcher)
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)

on 1.10.11:

            if len(pod_list.items) == 1 and \
                    self._try_numbers_do_not_match(context, pod_list.items[0]) and \
                    self.reattach_on_restart:
                self.log.info("found a running pod with labels %s but a different try_number"
                              "Will attach to this pod and monitor instead of starting new one", labels)
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
            elif len(pod_list.items) == 1:
                self.log.info("found a running pod with labels %s."
                              "Will monitor this pod instead of starting new one", labels)
                final_state, result = self.monitor_launched_pod(launcher, pod_list[0])
            else:
                final_state, _, result = self.create_new_pod_for_operator(labels, launcher)

So in the first use case on master we will always attach to the pod, and on 1.10.11 we will create a new pod. I assume that 1.10.11 is the behaviour you want however. Because in master it will never ever launch a new pod when the pod has failed it will just monitor the old one and fail again.

@stijndehaes
Copy link
Contributor Author

I just noticed that what I thought to do filtering on pending and running pods break the flow when restarting the scheduler. Since it will not find your completed pod. I am thinking the flow should change into:

  • search for pod with the same labels (including the try_id) if it is found -> Attach to the new pod
  • If it is not found we should probably always start a new one. I don't think it makes sense to attach to the pod of another try?

@stijndehaes
Copy link
Contributor Author

Looks like it is fixed by: #10230

@jrzdudek
Copy link

We've tested this with 1.10.12 which includes the #10230 fix, and the issue still exists. When the first job errors out, the subsequent retries reattach to the errored out pod.

found a running pod with labels {'dag_id': 'xxxx, 'task_id': 'xxxxx', 'execution_date': '2020-09-02T1000000000-7b0e1e2af', 'try_number': '2'} but a different try_number. Will attach to this pod and monitor instead of starting new one

The pod is in Error status in the cluster so should not be considered a candidate for re-attachment.

@tduriez-bc
Copy link

The problem still exists on Airflow version 1.10.14+composer

@matan129
Copy link

matan129 commented Apr 19, 2021

Happens on 2.0.1 as well. Any updates?

@tfedyanin
Copy link

I guess it happens when I use manual retry.
There is no label try_number=1, that provide uniqueness between runs. So on first manual retry pod is created correctly. And there is violation of uniqueness on next manual retries.

@jedcunningham
Copy link
Member

This should be fixed by #18070.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

6 participants