Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Invalid executor_config, pod_override filled with Encoding.VAR #26101

Closed
1 of 2 tasks
hterik opened this issue Sep 1, 2022 · 2 comments · Fixed by #26191
Closed
1 of 2 tasks

Kubernetes Invalid executor_config, pod_override filled with Encoding.VAR #26101

hterik opened this issue Sep 1, 2022 · 2 comments · Fixed by #26191
Labels
area:core kind:bug This is a clearly a bug
Milestone

Comments

@hterik
Copy link
Contributor

hterik commented Sep 1, 2022

Apache Airflow version

2.3.4

What happened

Trying to start Kubernetes tasks using a pod_override results in pods not starting after upgrading from 2.3.2 to 2.3.4

The pod_override look very odd, filled with many Encoding.VAR objects, see following scheduler log:

{kubernetes_executor.py:550} INFO - Add task TaskInstanceKey(dag_id='commit_check', task_id='sync_and_build', run_id='5776-2-1662037155', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'commit_check', 'sync_and_build', '5776-2-1662037155', '--local', '--subdir', 'DAGS_FOLDER/dag_on_commit.py'] with executor_config {'pod_override': {'Encoding.VAR': {'Encoding.VAR': {'Encoding.VAR': {'metadata': {'Encoding.VAR': {'annotations': {'Encoding.VAR': {}, 'Encoding.TYPE': 'dict'}}, 'Encoding.TYPE': 'dict'}, 'spec': {'Encoding.VAR': {'containers': REDACTED 'Encoding.TYPE': 'k8s.V1Pod'}, 'Encoding.TYPE': 'dict'}}
{kubernetes_executor.py:554} ERROR - Invalid executor_config for TaskInstanceKey(dag_id='commit_check', task_id='sync_and_build', run_id='5776-2-1662037155', try_number=1, map_index=-1)

Looking in the UI, the task get stuck in scheduled state forever. By clicking instance details, it shows similar state of the pod_override with many Encoding.VAR.

This appears like a recent addition, in 2.3.4 via #24356.
@dstandish do you understand if this is connected?

What you think should happen instead

No response

How to reproduce

No response

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.0.0
apache-airflow-providers-cncf-kubernetes==4.3.0
apache-airflow-providers-common-sql==1.1.0
apache-airflow-providers-docker==3.1.0
apache-airflow-providers-ftp==3.1.0
apache-airflow-providers-http==4.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-postgres==5.2.0
apache-airflow-providers-sqlite==3.2.0
kubernetes==23.6.0

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@hterik hterik added area:core kind:bug This is a clearly a bug labels Sep 1, 2022
@joshzana
Copy link

joshzana commented Sep 1, 2022

+1 seeing the same thing on 2.3.4, KubernetesExector. Wanted to provide more info.

We run 40 DAGs with several thousand total daily tasks, and every day we have 5-10 tasks that get stuck in Queued state.

We see the following sequence in the logs:
First this error the first time the task attempts to run

Invalid executor_config for ["tenant_extraction_acme_recruiting_id_72","discovery_finder_on_29735_agg_month","scheduled__2022-08-31T11:00:00+00:00",1,-1]

The in a repeated loop we see logs like this:

{base_executor.py:211} INFO - task TaskInstanceKey(dag_id='tenant_extraction_acme_recruiting_id_72', task_id='discovery_finder_on_29735_agg_month', run_id='scheduled__2022-08-31T11:00:00+00:00', try_number=1, map_index=-1) is still running

for each stuck task, followed shortly by:

{base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='tenant_extraction_acme_recruiting_id_72', task_id='discovery_finder_on_29735_agg_month', run_id='scheduled__2022-08-31T11:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

This continues until the task is marked as failed in the airflow UI.

The executor_config on the invalid tasks looks like this:

{'pod_override': {<Encoding.VAR: '__var'>: {'spec': {'containers': [{'name': 'base', 'resources': {'limits': {}, 'requests': {'memory': '16Gi', 'cpu': '1'}}}]}}, <Encoding.TYPE: '__type'>: <DagAttributeTypes.POD: 'k8s.V1Pod'>}}

Which is odd, because for the thousands of unaffected tasks it looks like this:

{'pod_override': {'api_version': None, 'kind': None, 'metadata': None, 'spec': {'active_deadline_seconds': None, 'affinity': None, 'automount_service_account_token': None, 'containers': [{'args': None, 'command': None, 'env': None, 'env_from': None, 'image': None, 'image_pull_policy': None, 'lifecycle': None, 'liveness_probe': None, 'name': 'base', 'ports': None, 'readiness_probe': None, 'resources': {'limits': {}, 'requests': {'cpu': '1', 'memory': '16Gi'}}, 'security_context': None, 'startup_probe': None, 'stdin': None, 'stdin_once': None, 'termination_message_path': None, 'termination_message_policy': None, 'tty': None, 'volume_devices': None, 'volume_mounts': None, 'working_dir': None}], 'dns_config': None, 'dns_policy': None, 'enable_service_links': None, 'ephemeral_containers': None, 'host_aliases': None, 'host_ipc': None, 'host_network': None, 'host_pid': None, 'hostname': None, 'image_pull_secrets': None, 'init_containers': None, 'node_name': None, 'node_selector': None, 'os': None, 'overhead': None, 'preemption_policy': None, 'priority': None, 'priority_class_name': None, 'readiness_gates': None, 'restart_policy': None, 'runtime_class_name': None, 'scheduler_name': None, 'security_context': None, 'service_account': None, 'service_account_name': None, 'set_hostname_as_fqdn': None, 'share_process_namespace': None, 'subdomain': None, 'termination_grace_period_seconds': None, 'tolerations': None, 'topology_spread_constraints': None, 'volumes': None}, 'status': None}}

@pacificsky
Copy link

@dstandish this seems to introduce non-deterministic behavior for any pod that uses KubernetesExecuter and specifies a pod override. Only pods that specify an override seem to be impacted. And the issue seems to be that the deserialization of the executer config is broken in some way in the new system.

This can be easily repro'd in the airflow web UI by hitting reload repeatedly on the task instance details pane for a task that uses KubernetesExecuter and specifies an executor config. I was even able to cause the web ui to crash as a result.

airflow_pickle_bug.txt

@ashb ashb added this to the Airflow 2.4.0 milestone Sep 6, 2022
dstandish added a commit to astronomer/airflow that referenced this issue Sep 7, 2022
The bind processor logic had the effect of applying serialization logic multiple times, which produced an incorrect serialization output.

Resolves apache#26101.
jedcunningham pushed a commit that referenced this issue Sep 7, 2022
The bind processor logic had the effect of applying serialization logic multiple times, which produced an incorrect serialization output.

Resolves #26101.
jedcunningham pushed a commit to astronomer/airflow that referenced this issue Oct 4, 2022
The bind processor logic had the effect of applying serialization logic multiple times, which produced an incorrect serialization output.

Resolves apache#26101.

(cherry picked from commit 87108d7)
Sunil0612 added a commit to grofers/airflow that referenced this issue Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants