Scheduler not terminating in case of repeated DB errors. #43440

iw-pavan · 2024-10-28T12:56:43Z

Apache Airflow version

2.10.2

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Scheduler was running and launching tasks normally.
Suddenly there was auth error on database operations.

psycopg2.OperationalError: connection to server at "<Host>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)
connection to server at "<HOST>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py", line 984, in _execute
    self._run_scheduler_loop()

After few retries it exited scheduler loop but process was not terminated.

What you think should happen instead?

After shutting down all executor and dag_processer process should exit.

How to reproduce

Using hybrid executors with Celery, Kubernetes

Introduce db errors.

Operating System

Mac/Linux

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else?

There are below logs repeated which indicates some threads not exited.

[2024-10-27T06:17:10.658+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.
[2024-10-27T06:17:11.658+0000] {kubernetes_executor_utils.py:140} INFO - Event: and now my watch begins starting at resource_version: 0
[2024-10-27T06:17:11.702+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-8ace-s4sjuxfo is Running, annotations: <omitted>
[2024-10-27T06:17:11.712+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-44fe-iwuzjfao is Running, annotations: <omitted>
[2024-10-27T06:17:41.715+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.

I see old PR for similar issue #28685
Should I change catch block to catch all exceptions?

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

potiuk · 2024-10-28T16:19:04Z

Feel free to propose a pr. Details can be discussed when you propose it.

iw-pavan added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Oct 28, 2024

dosubot bot added area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler labels Oct 28, 2024

potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Oct 28, 2024

pavansharma36 mentioned this issue Nov 4, 2024

terminate kubernetes watch in case of unknown error while flushing queue #43645

Merged

potiuk closed this as completed in #43645 Nov 12, 2024

eladkal mentioned this issue Nov 14, 2024

Status of testing Providers that were prepared on November 14, 2024 #44041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler not terminating in case of repeated DB errors. #43440

Scheduler not terminating in case of repeated DB errors. #43440

iw-pavan commented Oct 28, 2024 •

edited

Loading

potiuk commented Oct 28, 2024

Scheduler not terminating in case of repeated DB errors. #43440

Scheduler not terminating in case of repeated DB errors. #43440

Comments

iw-pavan commented Oct 28, 2024 • edited Loading

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

potiuk commented Oct 28, 2024

iw-pavan commented Oct 28, 2024 •

edited

Loading