Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesExecutor: All task pods are terminating with error while task succeed #16020

Closed
andormarkus opened this issue May 24, 2021 · 12 comments
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@andormarkus
Copy link
Contributor

Apache Airflow version: 2.0.2+
Kubernetes version: 1.20
Helm chart version: 1.0.0

What happened:
Successful task pods are terminating with error.

I have did further testing with different versions the see below my test results:

  • 2.0.1-python3.8 - OK
  • 2.0.2-python3.6 - NOK (helm chart default image)
  • 2.0.2-python3.8 - NOK
  • 2.1.0-python3.8 - NOK

Screenshot 2021-05-24 at 12 40 30

▶ kubectl -n airflow get pods

NAME                                              READY   STATUS      RESTARTS   AGE
airflow-s3-sync-1621853400-x8hzv                  0/1     Completed   0          11s
airflow-scheduler-865c754f55-6fdkt                2/2     Running     0          5m45s
airflow-scheduler-865c754f55-hqbv2                2/2     Running     0          5m45s
airflow-scheduler-865c754f55-hw65l                2/2     Running     0          5m45s
airflow-statsd-84f4f9898-r9xxm                    1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-28jxv                1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-d8wrf                1/1     Running     0          5m45s
airflow-webserver-7c66d4cd99-xn2hq                1/1     Running     0          5m45s
simpledagsleep.4862fcd4ec8c4adfb10e421feee88745   0/1     Error       0          2m25s
▶ kubectl -n airflow logs simpledagsleep.4862fcd4ec8c4adfb10e421feee88745

BACKEND=postgresql
DB_HOST=XXXXXXXXXXXXXXXXXXXXXXXX
DB_PORT=5432

[2021-05-24 10:47:57,843] {dagbag.py:451} INFO - Filling up the DagBag from /opt/airflow/dags/simple_dag.py
[2021-05-24 10:47:58,147] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=aws_default
[2021-05-24 10:47:58,780] {base_aws.py:391} WARNING - Unable to use Airflow Connection for credentials.
[2021-05-24 10:47:58,780] {base_aws.py:392} INFO - Fallback on boto3 credential strategy
[2021-05-24 10:47:58,781] {base_aws.py:395} INFO - Creating session using boto3 credential strategy region_name=eu-central-1
Running <TaskInstance: simple_dag.sleep 2021-05-24T10:47:46.486143+00:00 [queued]> on host simpledagsleep.4862fcd4ec8c4adfb10e421feee88745
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 235, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 142, in _execute
    self.on_kill()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 157, in on_kill
    self.task_runner.on_finish()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/base_task_runner.py", line 178, in on_finish
    self._error_file.close()
  File "/usr/local/lib/python3.8/tempfile.py", line 499, in close
    self._closer.close()
  File "/usr/local/lib/python3.8/tempfile.py", line 436, in close
    unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpt63agqia'

How to reproduce it:

simple_dag.py

import time

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    "owner"           : "airflow",
    "depends_on_past" : False,
    "start_date"      : datetime(2020, 1, 1),
    "email"           : ["[email protected]"],
    "email_on_failure": False,
    "email_on_retry"  : False,
    "retries"         : 1,
    "retry_delay"     : timedelta(minutes=5)
}


def sleep():
    time.sleep(60)
    return True


with DAG("simple_dag", default_args=default_args, schedule_interval="@once", catchup=False) as dag:
    t1 = PythonOperator(task_id="sleep", python_callable=sleep)

myconf.yaml

executor: KubernetesExecutor
fernetKey: "XXXXXXXXXX"

defaultAirflowTag: "2.0.2-python3.8"
airflowVersion: "2.0.2"


config:
  logging:
    colored_console_log: "True"
    remote_logging: "True"
    remote_base_log_folder: "cloudwatch://${log_group_arn}"
    remote_log_conn_id: "aws_default"
  core:
    load_examples: "False"
    store_dag_code: "True"
    parallelism: "1000"
    dag_concurrency: "1000"
    max_active_runs_per_dag: "1000"
    non_pooled_task_slot_count: "1000"
  scheduler:
    job_heartbeat_sec: 5
    scheduler_heartbeat_sec: 5
    parsing_processes: 2
  webserver:
    base_url: "http://${web_url}/airflow"
  secrets:
    backend: "airflow.contrib.secrets.aws_systems_manager.SystemsManagerParameterStoreBackend"
    backend_kwargs: XXXXXXXXXX

webserver:
  replicas: 3
  nodeSelector:
    namespace: airflow
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
  service:
    type: NodePort
ingress:
  enabled: true
  web:
    precedingPaths:
      - path: "/*"
        serviceName: "ssl-redirect"
        servicePort: "use-annotation"
    path: "/airflow/*"
    annotations:
      external-dns.alpha.kubernetes.io/hostname: ${web_url}
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/scheme: internal
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
      alb.ingress.kubernetes.io/certificate-arn: ${aws_acm_certificate_arn}
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
scheduler:
  replicas: 3
  nodeSelector:
    namespace: airflow
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
workers:
  serviceAccount:
    name: ${service_account_name}
    annotations:
      eks.amazonaws.com/role-arn: ${service_account_iamrole_arn}
dags:
  persistence:
    enabled: true
    storageClassName: ${storage_class_dags}
logs:
  persistence:
    enabled: true
    storageClassName: ${storage_class_logs}
postgresql:
  enabled: false
data:
  metadataSecretName: ${metadata_secret_name}
@kaxil
Copy link
Member

kaxil commented May 24, 2021

cc @ephraimbuddy @Dr-Denzy If you'll can pair up and take a look when you have time. See if you can reproduce it

@ephraimbuddy
Copy link
Contributor

cc @ephraimbuddy @Dr-Denzy If you'll can pair up and take a look when you have time. See if you can reproduce it

Cool. @Dr-Denzy let's try this tomorrow

@cccs-cat001
Copy link
Contributor

Hi we're experiencing this issue as well on 2.0.2 and 2.1.0, is there any updates on a potential fix? It's sort of blocking us from upgrading to the latest versions.

@ephraimbuddy
Copy link
Contributor

Hi we're experiencing this issue as well on 2.0.2 and 2.1.0, is there any updates on a potential fix? It's sort of blocking us from upgrading to the latest versions.

Please share your log messages

@cccs-cat001
Copy link
Contributor

@ephraimbuddy Here's the logs (sorry I'm a little late!)

Traceback (most recent call last):
  File "/opt/conda/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 237, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 102, in _execute
    self.task_runner.start()
  File "/opt/conda/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
    self.process = self._start_by_fork()
  File "/opt/conda/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 92, in _start_by_fork
    logging.shutdown()
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 2127, in shutdown
    h.close()
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1163, in close
    stream.close()
  File "/opt/conda/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1266, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
Running <TaskInstance: demo_git_notebook_parameterized.demo_git_notebook_parameterized 2021-06-28T14:00:08.688806+00:00 [queued]> on host demogitnotebookparameterizeddemogitnotebookparameterized.b1a282
Traceback (most recent call last):
  File "/opt/conda/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 237, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 147, in _execute
    self.on_kill()
  File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 166, in on_kill
    self.task_runner.on_finish()
  File "/opt/conda/lib/python3.8/site-packages/airflow/task/task_runner/base_task_runner.py", line 179, in on_finish
    self._error_file.close()
  File "/opt/conda/lib/python3.8/tempfile.py", line 499, in close
    self._closer.close()
  File "/opt/conda/lib/python3.8/tempfile.py", line 436, in close
    unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpcrf6qxnn'

@ephraimbuddy
Copy link
Contributor

Can you check if it still happens in main? I'm thinking it's related to #16227

@trucnguyenlam
Copy link

trucnguyenlam commented Jul 16, 2021

@ephraimbuddy how is it going with the issue, we are also experiencing this on version 2.1.1

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy how is it going with the issue, we are also experiencing this on version 2.1.1

I believe this issue was fixed in #16289 and will be available in 2.1.3,
For the meantime, set this environment variable: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION=False

@ephraimbuddy
Copy link
Contributor

Still happens. It will be fixed by #18269

@deveshbajaj59
Copy link

deveshbajaj59 commented Sep 24, 2021

@ephraimbuddy Any update on this issue, the issue seems to be still persists in the airflow 2.1.4 . I am specifically getting this error when I am passing the pod template

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy Any update on this issue, the issue seems to be still persists in the airflow 2.1.4 . I am specifically getting this error when I am passing the pod template

It'll be out in 2.2.0

@kaxil kaxil closed this as completed Dec 30, 2021
@kaxil
Copy link
Member

kaxil commented Dec 30, 2021

Closing since 2.2.0 is out but can open if you are facing same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

6 participants