-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
I have exactly the same problem, |
@gillbuchanan Thank you for reporting this. Just to clarify the source of the issue, does Airflow run correctly on your AKS setup with any of the other executors? |
Yes. I've tried this using |
For me it fails no matter which executor. Here's my helm commnad:
I added the ClusterRoleBinding but it doesn't help. Any help appreciated! |
We also have this problem. Additionally, DAG Pods receive SIGTERM and got killed after running for 30 minutes. |
I also have the same issue with KubernetesExecutor on AKS... It happens every 15 mins for some reason... |
I'm having a similar issue. Sometimes the tasks are tagged as success some are tagged as failed and they seem to be getting sigterms also using AKS |
We're having the same issue running Airflow 2.0.1 on AKS. |
It seems you linked back to this same issue. Is there another issue that is related to this? |
@gillbuchanan |
Any movement here? Currently I'm using |
Hello everyone. I'm having the same problem and I can't find the reason why either. |
Issue seems identical to mine. Please try to apply the patch in #14974 |
Did the patch @mrpowerus suggested work for you @luis-serra-ki @gillbuchanan ? |
I have pre-emptively marked it 2.0.2 but this might not end up in time for it but let's see |
Turning on tcp keepalive may (should) also help: |
@jedcunningham, I agree this should help. But for me it unfortunately doesn't. |
@gillbuchanan Can you test it with Airflow 2.0.2 - https://github.com/apache/airflow/blob/2.0.2/UPDATING.md#airflow-202 where we updated the default for |
I'm experiencing something similar but in 1.10.14 with the same symptoms @mrpowerus described in #14974 (like absence of The problem is that according to 1.10.14 docs, there's no AIRFLOW__KUBERNETES__ENABLE_TCP_KEEPALIVE env to set, and I'm not sure if there's another way to set it. |
Guys, I work with @alete89. Another solution for this, and specially if you are in older Airflow versions that still don't have the AIRFLOW__KUBERNETES__ENABLE_TCP_KEEPALIVE configuration key, is to execute at some moment at the start of airflow this in a python script: from urllib3.connection import HTTPConnection
import socket
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 20),
(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 5),
(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 10)
] This worked for us apparently, and basically set on urllib3 (which is the library that airflow uses for connectivity under the hood) the same parametry as was mentioned in this issue and in other places on the internet. In our case, aparently, there were some tcp hangup that provoke the consumption of all available executor capacity of parallelism. |
Is this still an issue in latest Airflow version & Kubernetes provider? |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version: 2.0.0
Kubernetes version:
Environment:
What happened:
After installing airflow in AKS via helm charts, webserver and scheduler start up as expected. After some time (with activity or while sitting idly) scheduler spits out the following:
scheduler error messages
What you expected to happen:
Scheduler should run (or sit idly) without error
How to reproduce it:
Unknown
Anything else we need to know:
Steps I've taken to debug:
Based on the location of the errors in the stack trace, I assumed the error was related to the
KubernetesExecutor
making an api request for a list of pods. To debug this Iexec
ed into the pod and ranwhich initially gave me a 403 forbidden error. I then created the following
ClusterRoleBinding
:rbac-read.yaml
Afterward the above bash commands successfully returned a list of pods in the cluster. I then opened a python shell (still within the
scheduler
pod) and successfully ranGiven that this ran successfully, I'm at a loss as to why I'm still getting the
ECONNRESET
error.The text was updated successfully, but these errors were encountered: