Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler not terminating in case of repeated DB errors. #43440

Closed
2 tasks done
iw-pavan opened this issue Oct 28, 2024 · 1 comment · Fixed by #43645
Closed
2 tasks done

Scheduler not terminating in case of repeated DB errors. #43440

iw-pavan opened this issue Oct 28, 2024 · 1 comment · Fixed by #43645
Labels
area:core area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler good first issue kind:bug This is a clearly a bug

Comments

@iw-pavan
Copy link
Contributor

iw-pavan commented Oct 28, 2024

Apache Airflow version

2.10.2

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Scheduler was running and launching tasks normally.
Suddenly there was auth error on database operations.

psycopg2.OperationalError: connection to server at "<Host>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)
connection to server at "<HOST>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py", line 984, in _execute
    self._run_scheduler_loop()

After few retries it exited scheduler loop but process was not terminated.

What you think should happen instead?

After shutting down all executor and dag_processer process should exit.

How to reproduce

Using hybrid executors with Celery, Kubernetes

Introduce db errors.

Operating System

Mac/Linux

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else?

There are below logs repeated which indicates some threads not exited.

[2024-10-27T06:17:10.658+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.
[2024-10-27T06:17:11.658+0000] {kubernetes_executor_utils.py:140} INFO - Event: and now my watch begins starting at resource_version: 0
[2024-10-27T06:17:11.702+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-8ace-s4sjuxfo is Running, annotations: <omitted>
[2024-10-27T06:17:11.712+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-44fe-iwuzjfao is Running, annotations: <omitted>
[2024-10-27T06:17:41.715+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.

I see old PR for similar issue #28685
Should I change catch block to catch all exceptions?

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@iw-pavan iw-pavan added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Oct 28, 2024
@dosubot dosubot bot added area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler labels Oct 28, 2024
@potiuk potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Oct 28, 2024
@potiuk
Copy link
Member

potiuk commented Oct 28, 2024

Feel free to propose a pr. Details can be discussed when you propose it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler good first issue kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants