Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesExecutor do not spawn Kubernetes Pod Operator pods. #17629

Closed
SimonOsipov opened this issue Aug 15, 2021 · 4 comments
Closed

KubernetesExecutor do not spawn Kubernetes Pod Operator pods. #17629

SimonOsipov opened this issue Aug 15, 2021 · 4 comments
Labels
affected_version:2.1 Issues Reported for 2.1 affected_version:2.2 Issues Reported for 2.2 area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@SimonOsipov
Copy link

Apache Airflow version: 2.1.2 (but happened on 2.0.2 as well)

OS: debian

Apache Airflow Provider versions:

apache-airflow-providers-celery==2.0.0
apache-airflow-providers-cncf-kubernetes==2.0.1
apache-airflow-providers-docker==2.1.0
apache-airflow-providers-elasticsearch==2.0.2
apache-airflow-providers-ftp==2.0.0
apache-airflow-providers-google==5.0.0
apache-airflow-providers-hashicorp==2.0.0
apache-airflow-providers-http==2.0.0
apache-airflow-providers-imap==2.0.0
apache-airflow-providers-mysql==2.1.0
apache-airflow-providers-postgres==2.0.0
apache-airflow-providers-redis==2.0.0
apache-airflow-providers-slack==4.0.0
apache-airflow-providers-sqlite==2.0.0
apache-airflow-providers-ssh==2.1.0

Deployment:

GKE

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.900", GitCommit:"28ab8501be88ea42e897ca8514d7cd0b436253d9", GitTreeState:"clean", BuildDate:"2021-06-30T09:23:36Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

What happened:
Airflow was working fine, but out of blue, stopped working properly.
Basically, DAG starts and then tasks first task fail, but they were running okay a week ago. KubernetesExecutor stopped spawning KubernetesPodOperators and there are no logs or errors. If I run directly (kubectl apply -f) template I use for Operator, it runs successfully.
Airflow utilize the same service account for Operator and Executor, so if Executor created successfully, that means no problem with a service account
Also, since DAG syncs successfully, no problem with git-sync.
kubectl get events does not show any attempts to create any pods for the task
dag_processor_manager.log also free of any errors.

So, Executor successfully created and proceeded to Complete and no Pod Operator was created.

Executor

apiVersion: v1
kind: Pod
metadata:
  ...
spec:
  restartPolicy: Never
  serviceAccountName: airflow # this account have rights to create pods
  automountServiceAccountToken: true
  volumes:
      - name: dags
        emptyDir: {}
      - name: logs
        emptyDir: {}
      - configMap:
          name: airflow-git-sync-configmap
        name: airflow-git-sync-configmap
  initContainers:
    - name: git-sync-clone
      securityContext:
        runAsUser: 65533
        runAsGroup: 65533
      image: k8s.gcr.io/git-sync/git-sync:v3.3.1
      imagePullPolicy: Always
      volumeMounts:
        - mountPath: /tmp/git
          name: dags
      resources:
         ...
      args: ["--one-time"]
      envFrom:
      - configMapRef:
          name: airflow-git-sync-configmap
      - secretRef:
          name: airflow-git-sync-secret
  containers:
    - name: base
      image: <artifactory_url>/airflow:latest
      volumeMounts:
        - name: dags
          mountPath: /opt/airflow/dags
        - name: logs
          mountPath: /opt/airflow/logs
      imagePullPolicy: Always

Pod template

apiVersion: v1
kind: Pod
metadata:
   ....
spec:
  serviceAccountName: airflow
  automountServiceAccountToken: true
  volumes:
    - name: sql
      emptyDir: {}
  initContainers:
    - name: git-sync
      image: k8s.gcr.io/git-sync/git-sync:v3.3.1
      imagePullPolicy: Always
      args: ["--one-time"]
      volumeMounts:
        - name: sql
          mountPath: /tmp/git/
      resources:
        requests:
          memory: 300Mi
          cpu: 500m
        limits:
          memory: 600Mi
          cpu: 1000m
      envFrom:
      - configMapRef:
          name: git-sync-configmap
      - secretRef:
          name: airflow-git-sync-secret
  containers:
    - name: base
      imagePullPolicy: Always
      image: <artifactory_url>/clickhouse-client-gcloud:20.6.4.44
      volumeMounts:
        - name: sql
          mountPath: /opt/sql
      resources:
         ....
      env:
        - name: GS_SERVICE_ACCOUNT
          valueFrom:
            secretKeyRef:
              name: gs-service-account
              key: service_account.json
        - name: DB_CREDENTIALS
          valueFrom:
            secretKeyRef:
              name: estimation-db-secret
              key: db_cred.json

DAG

from textwrap import dedent

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator

TEMPLATE_PATH = "/opt/airflow/dags/airflow-dags.git/pod_templates"

args = {
    ...
}


def create_pipeline(dag_: DAG):

    task_startup_client = KubernetesPodOperator(
        name="clickhouse-client",
        task_id="clickhouse-client",
        labels={"application": "clickhouse-client-gsutil"},
        pod_template_file=f"{TEMPLATE_PATH}/template.yaml",
        cmds=["sleep", "60000"],
        reattach_on_restart=True,
        is_delete_operator_pod=False,
        get_logs=True,
        log_events_on_failure=True,
        dag=dag_,
    )
  
    task_startup_client


with DAG(
    dag_id="MANUAL-GKE-clickhouse-client",
    default_args=args,
    schedule_interval=None,
    max_active_runs=1,
    start_date=days_ago(2),
    tags=["utility"],
) as dag:

    create_pipeline(dag)

In scheduler:

...
Event: manualgkeclickhouseclientaticlickhouseclient.9959fa1fd13a4b6fbdaf40549a09d2f9 Succeeded
...

Executor logs:

[2021-08-15 18:40:27,045] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:27,046] {settings.py:276} DEBUG - settings.prepare_engine_args(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
[2021-08-15 18:40:27,095] {cli_action_loggers.py:40} DEBUG - Adding <function default_action_log at 0x7f0556c5e280> to pre execution callback
[2021-08-15 18:40:28,070] {cli_action_loggers.py:66} DEBUG - Calling callbacks: [<function default_action_log at 0x7f0556c5e280>]
[2021-08-15 18:40:28,106] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:28,107] {settings.py:244} DEBUG - settings.prepare_engine_args(): Using NullPool
[2021-08-15 18:40:28,109] {dagbag.py:496} INFO - Filling up the DagBag from /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
[2021-08-15 18:40:28,110] {dagbag.py:311} DEBUG - Importing /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:26 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1Volume`.
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:27 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1VolumeMount`.
[2021-08-15 18:40:28,135] {dagbag.py:461} DEBUG - Loaded DAG <DAG: MANUAL-GKE-clickhouse-client>
[2021-08-15 18:40:28,176] {plugins_manager.py:281} DEBUG - Loading plugins
[2021-08-15 18:40:28,176] {plugins_manager.py:225} DEBUG - Loading plugins from directory: /opt/airflow/plugins
[2021-08-15 18:40:28,177] {plugins_manager.py:205} DEBUG - Loading plugins from entrypoints
[2021-08-15 18:40:28,238] {plugins_manager.py:418} DEBUG - Integrate DAG plugins
Running <TaskInstance: MANUAL-GKE-clickhouse-client.clickhouse-client 2021-08-15T18:39:38.150950+00:00 [queued]> on host manualgkeclickhouseclientclickhouseclient.9959fa1fd13a4b6fbd
[2021-08-15 18:40:28,670] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2021-08-15 18:40:28,670] {settings.py:302} DEBUG - Disposing DB connection pool (PID 1)

What you expected to happen:
Expect to Executor to create a pod for Kubernetes Pod Operator or throw any type of error.

@SimonOsipov SimonOsipov added the kind:bug This is a clearly a bug label Aug 15, 2021
@SimonOsipov
Copy link
Author

The only thing remotely related to errors I found:

[[34m2021-08-15 22:09:15,049[0m] {[34mkubernetes_executor.py:[0m172} ERROR[0m - Encountered Error response from k8s list namespaced pod stream => {'type': 'ERROR', 'object': {'api_version': 'v1',
 'kind': 'Status',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'initializers': None,
              'labels': None,
              'managed_fields': None,
              'name': None,
              'namespace': None,
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': None,
 'status': {'conditions': None,
            'container_statuses': None,
            'host_ip': None,
            'init_container_statuses': None,
            'message': None,
            'nominated_node_name': None,
            'phase': None,
            'pod_ip': None,
            'qos_class': None,
            'reason': None,
            'start_time': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 1599063897 (1599080636)', 'reason': 'Expired', 'code': 410}}[0m
[[34m2021-08-15 22:09:15,049[0m] {[34mkubernetes_executor.py:[0m175} INFO[0m - Kubernetes resource version is too old, must reset to 0 => ('too old resource version: 1599063897 (1599080636)',)[0m

@SimonOsipov
Copy link
Author

I redeployed configmap, used with the git-sync container, and this problem above (Kubernetes resource version is too old) is gone.
But the problem with 1.20 Kubernetes remains.
Airflow successfully creates Executor Pods, but it can't create any Kubernetes Pod Operators with no specific errors logged.

@mymzheka
Copy link

mymzheka commented Mar 19, 2022

Same situation with airflow 2.2.4 deployed from helm chart 1.5.0
openshift 3.11
kubernetes v.1.11.0
upd
Same result for k8s 1.21 and airflow 2.1.4 , no error message
Have no idea where I go wrong
upd2
worked on 1.21 with in_cluster=True

@eladkal eladkal added area:helm-chart Airflow Helm Chart affected_version:2.2 Issues Reported for 2.2 labels Mar 20, 2022
@apache apache locked and limited conversation to collaborators Mar 21, 2022
@potiuk potiuk converted this issue into discussion #22407 Mar 21, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
affected_version:2.1 Issues Reported for 2.1 affected_version:2.2 Issues Reported for 2.2 area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

4 participants