ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916

will-m-buchanan · 2021-01-26T16:49:00Z

Apache Airflow version: 2.0.0

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.13", GitCommit:"37c06f456fdb4d25e402b5fbcb72cd6a77a021a9", GitTreeState:"clean", BuildDate:"2020-09-18T21:59:14Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: Azure Kubernetes Service
Image : apache/airflow/2.0.0-python3.6
Config Variables:

AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags
AIRFLOW__CORE__DONOT_PICKLE=false
AIRFLOW__CORE__ENABLE_XCOM_PICKLING=false
AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
AIRFLOW__CORE__FERNET_KEY=*****
AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD=bash -c 'eval "$DATABASE_SQLALCHEMY_CMD"'
AIRFLOW__ELASTICSEARCH__WRITE_STDOUT=True
AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF=my-name-env
AIRFLOW__KUBERNETES__NAMESPACE=airflow
AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE=/home/airflow/scripts/pod-template.yaml
AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME=my-name
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=wasb://airflow-logs@******.blob.core.windows.net
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
AIRFLOW__WEBSERVER__BASE_URL=http://****/my-name
AIRFLOW__WEBSERVER__WEB_SERVER_PORT=8080

What happened:

After installing airflow in AKS via helm charts, webserver and scheduler start up as expected. After some time (with activity or while sitting idly) scheduler spits out the following:

scheduler error messages

[2021-01-26 16:22:08,620] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
    raise SocketError(str(e))
OSError: (104, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
    kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 436, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 763, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 693, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 318, in recv_into
    raise SocketError(str(e))
OSError: (104, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
    kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 792, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, \'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
[2021-01-26 16:22:10,177] {kubernetes_executor.py:266} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2021-01-26 16:22:10,189] {kubernetes_executor.py:126} INFO - Event: and now my watch begins starting at resource_version: 0
[2021-01-26 16:23:00,720] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs

What you expected to happen:

Scheduler should run (or sit idly) without error

How to reproduce it:
Unknown

Anything else we need to know:

Steps I've taken to debug:
Based on the location of the errors in the stack trace, I assumed the error was related to the KubernetesExecutor making an api request for a list of pods. To debug this I execed into the pod and ran

KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/pods/

which initially gave me a 403 forbidden error. I then created the following ClusterRoleBinding:

rbac-read.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: system:serviceaccount:airflow:my-name:read-pods
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: my-name
    namespace: airflow
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

Afterward the above bash commands successfully returned a list of pods in the cluster. I then opened a python shell (still within the scheduler pod) and successfully ran

>>> from kubernetes import client, config
>>> config.load_incluster_config()
>>> v1 = client.CoreV1Api()
>>> pods = v1.list_pod_for_all_namespaces(watch=False)
>>> airflow_pods = v1.list_namespaced_pod("airflow")

Given that this ran successfully, I'm at a loss as to why I'm still getting the ECONNRESET error.

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2021-01-26T16:49:02Z

Thanks for opening your first issue here! Be sure to follow the issue template!

ams0 · 2021-01-27T12:56:39Z

I have exactly the same problem, KubernetesExecutor doesn't run any job. I applied the ClusterRoleBinding (using airflow-scheduler service account but no luck.

vikramkoka · 2021-01-27T15:58:21Z

@gillbuchanan Thank you for reporting this. Just to clarify the source of the issue, does Airflow run correctly on your AKS setup with any of the other executors?

will-m-buchanan · 2021-01-27T16:19:40Z

@gillbuchanan Thank you for reporting this. Just to clarify the source of the issue, does Airflow run correctly on your AKS setup with any of the other executors?

Yes. I've tried this using SequentialExecutor and did not run into issues (obviously I wouldn't use this in production)

ams0 · 2021-01-27T23:00:27Z

For me it fails no matter which executor. Here's my helm commnad:

helm upgrade --install airflow . --namespace airflow --create-namespace \
  --values ../../airflow-values.yaml \
  --set executor="SequentialExecutor" \
  --set webserver.allowPodLogReading=true \
  --set webserver.defaultUser.password="xxx"  \
  --set ingress.enabled=true \
  --set ingress.web.host="airflow.ingress.xxx.com" \
  --set ingress.web.tls.enabled=true \
  --set ingress.web.annotations."cert-manager\.io/cluster-issuer"=letsencrypt-prod \
  --set ingress.web.tls.secretName=airflow-tls-secret \
  --set dags.persistence.enabled=false \
  --set dags.gitSync.enabled=true \
  --set dags.gitSync.repo="https://github.com/ams0/dags.git" \
  --set dags.gitSync.subPath="dags" \
  --set dags.gitSync.branch=main \
  --set images.gitSync.repository="k8s.gcr.io/git-sync/git-sync" \
  --set images.gitSync.tag="v3.2.0" \
  --set airflow.fernetKey="xxx="

I added the ClusterRoleBinding but it doesn't help. Any help appreciated!

newhardwarefound · 2021-02-10T21:29:28Z

We also have this problem. Additionally, DAG Pods receive SIGTERM and got killed after running for 30 minutes.

stoiandl · 2021-02-22T13:17:22Z

I also have the same issue with KubernetesExecutor on AKS... It happens every 15 mins for some reason...

grillorafael · 2021-02-25T14:56:06Z

I'm having a similar issue. Sometimes the tasks are tagged as success some are tagged as failed and they seem to be getting sigterms also using AKS

StephanZaat · 2021-03-01T21:13:47Z

We're having the same issue running Airflow 2.0.1 on AKS.
It seems this issue is related:
#13916

will-m-buchanan · 2021-03-02T18:52:08Z

We're having the same issue running Airflow 2.0.1 on AKS.
It seems this issue is related:
#13916

It seems you linked back to this same issue. Is there another issue that is related to this?

StephanZaat · 2021-03-02T19:12:15Z

@gillbuchanan
Indeed. Too many tabs and too late! ;)
#14175

will-m-buchanan · 2021-03-18T14:20:02Z

Any movement here? Currently I'm using CeleryExecutor instead of KubernetesExecutor and am not facing the same issue, but I would like to move back to KubernetesExecutor

serrovsky · 2021-03-22T22:38:52Z

Hello everyone. I'm having the same problem and I can't find the reason why either.

mrpowerus · 2021-03-30T15:48:03Z

Issue seems identical to mine. Please try to apply the patch in #14974
It would be nice if it helps! @gillbuchanan

kaxil · 2021-03-31T10:49:49Z

Did the patch @mrpowerus suggested work for you @luis-serra-ki @gillbuchanan ?

kaxil · 2021-03-31T10:50:20Z

I have pre-emptively marked it 2.0.2 but this might not end up in time for it but let's see

jedcunningham · 2021-04-05T22:52:53Z

Turning on tcp keepalive may (should) also help:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#enable-tcp-keepalive

mrpowerus · 2021-04-15T18:29:01Z

@jedcunningham, I agree this should help. But for me it unfortunately doesn't.

kaxil · 2021-04-21T17:15:30Z

@gillbuchanan Can you test it with Airflow 2.0.2 - https://github.com/apache/airflow/blob/2.0.2/UPDATING.md#airflow-202 where we updated the default for [kubernetes ] enable_tcp_keepalive = True

alete89 · 2021-04-29T11:32:33Z

I'm experiencing something similar but in 1.10.14 with the same symptoms @mrpowerus described in #14974 (like absence of Event: in logging) and also open_slots going down to 0 because it seems like the executor never realizes tasks are getting done, never frees the slot (although they are marked as done in the UI).

The problem is that according to 1.10.14 docs, there's no AIRFLOW__KUBERNETES__ENABLE_TCP_KEEPALIVE env to set, and I'm not sure if there's another way to set it.

Frietziek · 2021-04-30T15:54:10Z

Guys, I work with @alete89. Another solution for this, and specially if you are in older Airflow versions that still don't have the AIRFLOW__KUBERNETES__ENABLE_TCP_KEEPALIVE configuration key, is to execute at some moment at the start of airflow this in a python script:

from urllib3.connection import HTTPConnection
import socket


HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
    (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 20),
    (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 5),
    (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 10)
]

This worked for us apparently, and basically set on urllib3 (which is the library that airflow uses for connectivity under the hood) the same parametry as was mentioned in this issue and in other places on the internet.

In our case, aparently, there were some tcp hangup that provoke the consumption of all available executor capacity of parallelism.

eladkal · 2021-11-05T13:11:02Z

Is this still an issue in latest Airflow version & Kubernetes provider?

github-actions · 2021-12-09T00:07:09Z

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2021-12-16T00:07:10Z

This issue has been closed because it has not received response from the issue author.

will-m-buchanan added the kind:bug This is a clearly a bug label Jan 26, 2021

vikramkoka added the affected_version:2.0 Issues Reported for 2.0 label Jan 27, 2021

vikramkoka added the provider:cncf-kubernetes Kubernetes provider related issues label Jan 27, 2021

vikramkoka added the priority:high High priority bug that should be patched quickly but does not require immediate new release label Mar 30, 2021

kaxil added this to the Airflow 2.0.2 milestone Mar 31, 2021

ephraimbuddy mentioned this issue Apr 18, 2021

Parse Error 410 in kubernetes Watcher and return latest resource version #15418

Closed

ashb modified the milestones: Airflow 2.0.2, Airflow 2.0.3 Apr 22, 2021

ephraimbuddy mentioned this issue Apr 23, 2021

Handle kubernetes watcher stream disconnection #15500

Closed

ashb modified the milestones: Airflow 2.0.3, Airflow 2.1.1 May 7, 2021

kaxil removed this from the Airflow 2.1.1 milestone Jun 22, 2021

eladkal added the pending-response label Nov 5, 2021

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 9, 2021

github-actions bot closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916

ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916

will-m-buchanan commented Jan 26, 2021 •

edited

Loading

boring-cyborg bot commented Jan 26, 2021

ams0 commented Jan 27, 2021

vikramkoka commented Jan 27, 2021

will-m-buchanan commented Jan 27, 2021

ams0 commented Jan 27, 2021

newhardwarefound commented Feb 10, 2021 •

edited

Loading

stoiandl commented Feb 22, 2021

grillorafael commented Feb 25, 2021

StephanZaat commented Mar 1, 2021 •

edited

Loading

will-m-buchanan commented Mar 2, 2021

StephanZaat commented Mar 2, 2021

will-m-buchanan commented Mar 18, 2021

serrovsky commented Mar 22, 2021

mrpowerus commented Mar 30, 2021 •

edited

Loading

kaxil commented Mar 31, 2021

kaxil commented Mar 31, 2021

jedcunningham commented Apr 5, 2021

mrpowerus commented Apr 15, 2021

kaxil commented Apr 21, 2021

alete89 commented Apr 29, 2021

Frietziek commented Apr 30, 2021 •

edited

Loading

eladkal commented Nov 5, 2021

github-actions bot commented Dec 9, 2021

github-actions bot commented Dec 16, 2021

ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916

ECONNRESET error in scheduler using KubernetesExecutor on AKS #13916

Comments

will-m-buchanan commented Jan 26, 2021 • edited Loading

boring-cyborg bot commented Jan 26, 2021

ams0 commented Jan 27, 2021

vikramkoka commented Jan 27, 2021

will-m-buchanan commented Jan 27, 2021

ams0 commented Jan 27, 2021

newhardwarefound commented Feb 10, 2021 • edited Loading

stoiandl commented Feb 22, 2021

grillorafael commented Feb 25, 2021

StephanZaat commented Mar 1, 2021 • edited Loading

will-m-buchanan commented Mar 2, 2021

StephanZaat commented Mar 2, 2021

will-m-buchanan commented Mar 18, 2021

serrovsky commented Mar 22, 2021

mrpowerus commented Mar 30, 2021 • edited Loading

kaxil commented Mar 31, 2021

kaxil commented Mar 31, 2021

jedcunningham commented Apr 5, 2021

mrpowerus commented Apr 15, 2021

kaxil commented Apr 21, 2021

alete89 commented Apr 29, 2021

Frietziek commented Apr 30, 2021 • edited Loading

eladkal commented Nov 5, 2021

github-actions bot commented Dec 9, 2021

github-actions bot commented Dec 16, 2021

will-m-buchanan commented Jan 26, 2021 •

edited

Loading

newhardwarefound commented Feb 10, 2021 •

edited

Loading

StephanZaat commented Mar 1, 2021 •

edited

Loading

mrpowerus commented Mar 30, 2021 •

edited

Loading

Frietziek commented Apr 30, 2021 •

edited

Loading