-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network instabilities are able to freeze KubernetesJobWatcher #12644
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
@guillemborrell We're also seeing this with kubernetes. We seem to hit:
We're on version We're seeing this regularly when kubernetes undergoes maintenance on the master node. There is a similar issue here against the python k8s client, which I believe airflow is using: Separately we're seeing issues with the pod operator when using the kubernetes task executor. The parent pod appears to stop seeing events from the child pod it created. Thus it never sees that the child pod completes, and leaving the parent pod and hence the job hanging forever. |
There is also this issue which seems to be what we are seeing when using the kubernetes task executor, launching child pods with the pod operator kubernetes-client/python#1234 |
Looking at our logs for when we see k8s maintenance, we also get this error once it comes back:
Not sure if this is somehow meaning that it no longer sees the events of pods created before maintenance started, which have since stopped? Maintenance seems to take around 3 or 4 mins, but if pods stop during that time will airflow notice once it reconnects, or will it only see events from post reconnecting? Anyhow it seems we get into the state where no new pods are launched and old completed pods are not cleaned up.... This only happens once the scheduler pod is terminated, at that point it does notice that the completed pods have finished and cleans them up. There might be a fix for this already in master, or at least there seems to be a change to how orphaned pods are collected. When we restart the scheduler we see
This then identifies the completed pods. However it only runs on startup: In master this has been modified and now runs every 5 minutes by default: |
We consistently experienced kubernetes executor slot starvation, as described above, where worker pods get stuck in a completed state and are never deleted due to indefinite blocking in the KubernetesJobWatcher watch: https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322 The indefinite blocking is due to a lack of tcp keepalives or a default _request_timeout (socket timeout) in kube_client_request_args: We were able to consistently reproduce this behavior by injecting network faults or clearing the conntrack state on the node where the scheduler was running as part of an overlay network. Setting a socket timeout, _request_timeout in kube_client_request_args, prevents executor slot starvation since the KubernetesJobWatcher recovers once the timeout is reached and properly cleans up worker pods stuck in the completed state.
We currently set the _request_timeout to 10 minutes so we won't see a timeout unless there's a network fault -- since the kubernetes watch itself will expire before this (after 5 min). I think it makes sense to consider setting a default _request_timeout, even if the value is high, to protect against executor slot starvation and unavailability during network faults. |
Hello any update about this issue, I have the same problem in Airflow 2.0.0 |
Is this still reproducible in latest Airflow version? |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version:
1.10.10 and 1.10.12 so far
Kubernetes version (if you are using kubernetes) (use kubectl version): v1.16.10
Environment:
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:
What happened:
We are running Airflow with the KubernetesExecutor on a Kubernetes cluster with a somewhat unreliable network. The scheduler runs smoothly until it logs the following line:
From then on, the executor is able to schedule jobs, but the job watcher stops processing events. The resulting behaviour is that new jobs are scheduled, but all jobs with status
Completed
remain in the cluster instead of being terminated. We mitigate this condition by periodically restarting the scheduler, which cleans up all the completed tasks correctly.What you expected to happen:
Either KubernetesJobWatcher opens a new connection and goes on processing events, or it raises an exception shutting the process down so it can be restarted.
How to reproduce the bug:
Take down the connection between the scheduler pod and the Kubernetes API.
The text was updated successfully, but these errors were encountered: