Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow Scheduler with Kubernetes Executor has errors in logs and stuck slots with no running tasks #36478

Closed
1 of 2 tasks
crabio opened this issue Dec 28, 2023 · 14 comments · Fixed by #36240
Closed
1 of 2 tasks
Labels
area:core good first issue kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:cncf-kubernetes Kubernetes provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@crabio
Copy link

crabio commented Dec 28, 2023

Apache Airflow version

2.8.0

If "Other Airflow 2 version" selected, which one?

2.7.3

What happened?

I found race condition between 2 schedulers.

Scheduler 1 starts task pod
task finished
Scheduler 2 processes task finish and kill pod
Scheduler 1 thinks that task is still running, but Scheduler 2 updates tasks status in the database.

What you think should happen instead?

Multiple schedulers should process properly concurrent work with Kubernetes Executor

How to reproduce

  1. Deploy Airflow via Community Helm chart with Kubernetes Executor
  2. Add more than 50 tasks in 2-3 DAGs

Operating System

Docker based on apache/airflow:2.7.3

Versions of Apache Airflow Providers

apache-airflow == 2.7.3
dbt-core == 1.6.6
dbt-snowflake == 1.6.4
apache-airflow-providers-snowflake
apache-airflow[statsd]
facebook-business == 16.0.2
google-ads == 21.1.0
twitter-ads == 11.0.0
acryl-datahub-airflow-plugin
acryl-datahub[dbt]
checksumdir
filelock
openpyxl
cronsim
apache-airflow-providers-cncf-kubernetes==7.8.0
apache-airflow-providers-apache-kafka == 1.2.0
kubernetes
snowplow_analytics_sdk

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else?

logs:
Untitled discover search.csv

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@crabio crabio added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 28, 2023
@crabio
Copy link
Author

crabio commented Dec 28, 2023

Discussed in #35426

@dirrao
Copy link
Contributor

dirrao commented Dec 29, 2023

Hi @crabio,
The K8s executor open_slots are full over the time. This is due to the adoption of completed pods by schedulers. We have found a leak in the open slots. It will be accumulated over time and no slots are available to schedule the tasks. This is fixed in the #36240. Multiple fixes are available in the CNCF Kubernetes provider 7.12.0. You can try and let us know.

@RNHTTR RNHTTR added provider:cncf-kubernetes Kubernetes provider related issues and removed needs-triage label for new issues that we didn't triage yet labels Dec 29, 2023
@crabio
Copy link
Author

crabio commented Dec 29, 2023

Thank you (@dirrao) very much!
I will try and be back soon

@crabio
Copy link
Author

crabio commented Jan 2, 2024

@dirrao Seems it works well! thank you
I will watch for some days for fix and reopen issue if anything will be wrong

@crabio crabio closed this as completed Jan 2, 2024
@crabio
Copy link
Author

crabio commented Jan 5, 2024

Hello @dirrao
I upgraded my Airflow package apache-airflow-providers-cncf-kubernetes==7.13.0 and increased Schedulers replicas to 2 and in our Dev environment all was fine. But on prod, where we have more load - scheduler starts leak again.
image

@crabio crabio reopened this Jan 5, 2024
@dirrao dirrao added the needs-triage label for new issues that we didn't triage yet label Jan 5, 2024
@dirrao
Copy link
Contributor

dirrao commented Jan 5, 2024

@crabio,
Thanks for updating us on this. This requires triaging. Meanwhile, you can bump up the parallelism configuration to a higher number to beat the leak. Or Restart the scheduler after a certain number of iterations to rest these values.

@crabio
Copy link
Author

crabio commented Jan 5, 2024

@dirrao Yes, thank you!

I rollback our cluster to 1 Scheduler and 64 parallelism to have the same capacity :)
About restart - I found that it could be made in the Airflow < 2.x.x, because scheduler run duration removed from the configuration.
I can do it manually or via Kubernetes job :)
But it seems not so good idea

@dirrao
Copy link
Contributor

dirrao commented Jan 10, 2024

@dirrao Yes, thank you!

I rollback our cluster to 1 Scheduler and 64 parallelism to have the same capacity :) About restart - I found that it could be made in the Airflow < 2.x.x, because scheduler run duration removed from the configuration. I can do it manually or via Kubernetes job :) But it seems not so good idea

You can deduce the number of iterations based on scheduler loop time metric.

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 25, 2024
Copy link

github-actions bot commented Feb 2, 2024

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 2, 2024
@crabio
Copy link
Author

crabio commented Apr 9, 2024

Hi!
I have tested this bug on the Airflow 2.8.4 and it still present...

@crabio
Copy link
Author

crabio commented Apr 9, 2024

@ephraimbuddy @dirrao may you reopen the issue?

@crabio
Copy link
Author

crabio commented Apr 9, 2024

duplicated in #36998

@ephraimbuddy
Copy link
Contributor

duplicated in #36998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core good first issue kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:cncf-kubernetes Kubernetes provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants