Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow 2.7.1 can not start Scheduler & trigger #34816

Closed
2 tasks done
ThuanDoHung opened this issue Oct 7, 2023 · 21 comments · Fixed by #34931
Closed
2 tasks done

Airflow 2.7.1 can not start Scheduler & trigger #34816

ThuanDoHung opened this issue Oct 7, 2023 · 21 comments · Fixed by #34931
Assignees
Labels
affected_version:2.6 Issues Reported for 2.6 area:core area:Triggerer kind:bug This is a clearly a bug
Milestone

Comments

@ThuanDoHung
Copy link

Apache Airflow version

2.7.1

What happened

After upgrade from 2.6.0 to 2.7.1 (try pip uninstall apache-airflow, and clear dir airflow - remove airflow.cfg), I can start scheduler & trigger with daemon.
I try start with command, it can start, but logout console it killed.
I try: airflow scheduler or airflow triggerer :done but kill when logout console
airflow scheduler --daemon && airflow triggerer --daemon: fail, can not start scheduler & triggerer (2.6.0 run ok). but start deamon with webserver & celery worker is fine

Help me

What you think should happen instead

No response

How to reproduce

  1. run airflow 2.6.0 fine on ubuntu server 22.04.3 lts
  2. install airflow 2.7.1
  3. can not start daemon triggerer & scheduler

Operating System

ubuntu server 22.04.3 LTS

Versions of Apache Airflow Providers

No response

Deployment

Virtualenv installation

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ThuanDoHung ThuanDoHung added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Oct 7, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 7, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@jscheffl
Copy link
Contributor

jscheffl commented Oct 8, 2023

Some questions regarding your report:

  • Have you tried a clean install of Airflow 2.7.1 as well?
  • Are you running in a Venv or in a system site package install?
  • Can you please attach some logs to the problem? Especially what the logs of daemon processes are?
  • Also can you describe how your service is usually instantiated? If you kill the terminal where you started it is normal that child processes are terminated as well. If you run in daemon mode they usually stay alive.

@ThuanDoHung
Copy link
Author

  • i run venv
  • if start airflow triggerer then run and show log to console, but ctrl+c or close terminal triggerer killed.
  • if start airflow trigger --daemon, i dont show any log, i dont known

@ThuanDoHung
Copy link
Author

airflow triggerer --capacity 5 --log-file ./logs/triggerer/triggerer.log --pid ./triggerer.pid --daemon


____ |( )_______ / /________ __
____ /| |_ /__ / / __ / __ _ | /| / /
___ ___ | / _ / _ / _ / / // / |/ |/ /
// |// // // // _
/____/|__/
[2023-10-10T00:03:58.394+0700] {triggerer_job_runner.py:171} INFO - Setting up TriggererHandlerWrapper with handler <FileTaskHandler (NOTSET)>
[2023-10-10T00:03:58.397+0700] {triggerer_job_runner.py:227} INFO - Setting up logging queue listener with handlers [<RedirectStdHandler (NOTSET)>, <TriggererHandlerWrapper (NOTSET)>]

--> any log on console, not show log in file

@jscheffl
Copy link
Contributor

jscheffl commented Oct 9, 2023

Sorry I can not understand you last comment completely.

  • Can you please repeat and also add the verbose option via --verbose to the command?
  • And can you also please post the logs of the scheduler as well?
  • Have you tried a clean install of Airflow 2.7.1 as well?

@ThuanDoHung
Copy link
Author

ThuanDoHung commented Oct 11, 2023

Hi,
I try:
Install new server (ubuntu server 22.04.3 LTS) and postgresql 16,
Install airflow:

wget https://raw.githubusercontent.com/apache/airflow/constraints-2.7.1/constraints-3.10.txt
pip3 install "apache-airflow[celery,cgroups,dask,statsd,virtualenv,redis]==2.7.1" --constraint "/home/constraints-3.10.txt"

then:

  • config AIRFLOW_HOME in ~/.profile
  • set up database on postgres and run airflow db migrate
  • config database on airflow.cfg
  • run:
    +/ airflow webserver --daemon --> ok, show webui on brower
    +/ airflow scheduler --daemon --> ok, show on cluster activity tab --> scheduler was Healthy
    +/ airflow triggerer --daemon --> NOT OK, show on cluter actity tab --> scheduler was Healthy, TRIGGERER UNHEALTHY
    i try:
    • airflow trigger --> cluster activity tab --> scheduler & triggerer was healthy but close terminal triggerer killed.

airflow triggerer --daemon, it's NOT OK:
Screenshot 2023-10-11 at 12 25 50

Screenshot 2023-10-11 at 12 29 54

airflow triggerer (without --daemon), it's OK:
Screenshot 2023-10-11 at 12 26 35

run with -v (error same without -v), it's NOT OK
Screenshot 2023-10-11 at 12 36 04
Screenshot 2023-10-11 at 12 36 18

@ThuanDoHung
Copy link
Author

I think if run triggerer on daemon, it not start gunicorn (i show gunicorn start on log file). if i run without daemon, i show starting gunicorn on console.

@calfzhou
Copy link
Contributor

The same issue with 2.7.2, airflow triggerer -D not working anymore. In the error log file, i see that triggerer was kill by a term signal Handling signal: term right after worker booted.

[2023-10-13 12:15:23 +0800] [722477] [INFO] Starting gunicorn 21.2.0
[2023-10-13 12:15:23 +0800] [722477] [INFO] Listening at: http://[::]:8794 (722477)
[2023-10-13 12:15:23 +0800] [722477] [INFO] Using worker: sync
[2023-10-13 12:15:23 +0800] [722479] [INFO] Booting worker with pid: 722479
[2023-10-13 12:15:23 +0800] [722480] [INFO] Booting worker with pid: 722480
[2023-10-13 12:15:23 +0800] [722477] [INFO] Handling signal: term
[2023-10-13 12:15:23 +0800] [722479] [INFO] Worker exiting (pid: 722479)
[2023-10-13 12:15:23 +0800] [722480] [INFO] Worker exiting (pid: 722480)
[2023-10-13 12:15:23 +0800] [722477] [INFO] Shutting down: Master

@Bisk1
Copy link
Contributor

Bisk1 commented Oct 13, 2023

I tested a couple of versions and can confirm it, the regression took place between 2.6.2 and 2.6.3 - in 2.6.3 (and later), after starting triggerer with -D option, airflow-triggerer.pid is not created, the process dies immediately.

@ThuanDoHung
Copy link
Author

I tested a couple of versions and can confirm it, the regression took place between 2.6.2 and 2.6.3 - in 2.6.3 (and later), after starting triggerer with -D option, airflow-triggerer.pid is not created, the process dies immediately.

maybe can fix it or wait new version of airflow?

@Bisk1
Copy link
Contributor

Bisk1 commented Oct 13, 2023

I think I isolated the code change that caused the problem - the problem starts to appear after this commit: d6cc9e4#diff-9cea7921268261a177e82c16fd5111f8d3252e3ca0267bdfb397c379c5d70857R353-R355

When I comment out these lines, the triggerer starts and the demon thread continues to run successfully.

@ThuanDoHung for now I'm trying to triage the bug, don't know how to fix it yet

@Bisk1
Copy link
Contributor

Bisk1 commented Oct 13, 2023

After debugging I think I understand the problem fully. Here is what happens when we run triggerer in daemon mode on current main.

  1. Triggerer's main thread creates triggerer_job_runner triggerer_job_runner which internally creates TriggerRunner - an async thread, but doesn't start it yet.
  2. Triggerer's main thread enters daemon context
    daemon_context = daemon.DaemonContext(
    which internally forks itself and exits in parent process so that only fork continues to run https://pagure.io/python-daemon/blob/main/f/daemon/daemon.py#_683
  3. After fork only the calling thread stays alive as it is the result of POSIX complaince. Python respects it by setting all other threads to _is_stopped = True https://github.com/python/cpython/blob/b2ab210aaefb3b0e39f28e7946b7a531d7b2ab17/Lib/threading.py#L1690 This affects the TriggerRunner thread which is set to stopped as well.
  4. TriggerRunner is started successfully
    self.trigger_runner.start()
    Apparently being in a stopped state doesn't prevent it from starting.
  5. By the time when the code reaches the is_alive() check it returns false (because is_alive() inspects internal state where _is_stopped = True) so the code finishes execution early. d6cc9e4#diff-9cea7921268261a177e82c16fd5111f8d3252e3ca0267bdfb397c379c5d70857R353-R355

Before the change #32092 it wasn't a problem because apparently a thread in stopped state runs fine unless someone verifies its state (which is inconsistent).

My proposal to fix it:

  1. Switch order of operations when running triggerer command in daemon mode so that the async thread is created after entering daemon context, e.g. move the thread initialization https://github.com/apache/airflow/blob/e9987d50598f70d84cbb2a5d964e21020e81c080/airflow/cli/commands/triggerer_command.py#L61C5-L61C25 from line 61 to 80 and 86 (worked when I tested it locally).
  2. Refactor all the commands that have daemon mode (scheduler, webserver, etc.) to reuse the same pattern.

@jens-scheffler-bosch I'm willing to work on it

@jscheffl
Copy link
Contributor

Wow. Cool analysis. Respect. This is not an easy catch! Thanks for investing the time locating the root!

Yes, with this explanaition it all makes much sense. Happy to see a contribution!

@jscheffl jscheffl added affected_version:2.6 Issues Reported for 2.6 area:Triggerer and removed pending-response needs-triage label for new issues that we didn't triage yet labels Oct 13, 2023
@jscheffl jscheffl added this to the Airflow 2.7.3 milestone Oct 13, 2023
Bisk1 pushed a commit to Bisk1/airflow that referenced this issue Oct 13, 2023
Change the order of operations so that async child thread is created after forking when entering daemon context.

This makes sure that the thread stays alive in the internal loop.
@Bisk1
Copy link
Contributor

Bisk1 commented Oct 13, 2023

Also contrary to the original issue description and I don't see any problem with the scheduler in daemon mode. But maybe it was a communication issue.

@potiuk
Copy link
Member

potiuk commented Oct 14, 2023

Very cool analysis indeed. It all makes perfect sense. Thanks for both - analysis and the fix!

potiuk pushed a commit that referenced this issue Oct 14, 2023
* Fixes #34816

Change the order of operations so that async child thread is created after forking when entering daemon context.

This makes sure that the thread stays alive in the internal loop.

---------

Co-authored-by: daniel.dylag <[email protected]>
ephraimbuddy pushed a commit that referenced this issue Oct 29, 2023
* Fixes #34816

Change the order of operations so that async child thread is created after forking when entering daemon context.

This makes sure that the thread stays alive in the internal loop.

---------

Co-authored-by: daniel.dylag <[email protected]>
(cherry picked from commit 9c1e8c2)
ephraimbuddy pushed a commit that referenced this issue Oct 30, 2023
* Fixes #34816

Change the order of operations so that async child thread is created after forking when entering daemon context.

This makes sure that the thread stays alive in the internal loop.

---------

Co-authored-by: daniel.dylag <[email protected]>
(cherry picked from commit 9c1e8c2)
@ThuanDoHung
Copy link
Author

HI, on 2.7.3 i show scheduler was fixed but triggerer was not fixed. i can not start trigger on daemon, log of triggerer same 2.7.2.

@potiuk
Copy link
Member

potiuk commented Nov 7, 2023

HI, on 2.7.3 i show scheduler was fixed but triggerer was not fixed. i can not start trigger on daemon, log of triggerer same 2.7.2.

Can you please open a new issue about it with logs showing what's going on. We are not going to re-open that onte as it is already part of the relesed fix. Thanks in advance.

@soidamientrung
Copy link

HI, on 2.7.3 i show scheduler was fixed but triggerer was not fixed. i can not start trigger on daemon, log of triggerer same 2.7.2.

Do you have solution for this error @ThuanDoHung ?

@mvoitko
Copy link

mvoitko commented Sep 5, 2024

The issue is still reproducible in Airflow 2.9.2 with AWS MWAA

@potiuk
Copy link
Member

potiuk commented Sep 5, 2024

The issue is still reproducible in Airflow 2.9.2 with AWS MWAA

And this comment is still valid. Absolutely nothing changed since

Can you please open a new issue about it with logs showing what's going on. We are not going to re-open that onte as it is already part of the relesed fix. Thanks in advance.

@potiuk
Copy link
Member

potiuk commented Sep 5, 2024

So if you really want the problme solved @mvoitko rather than just complaining you have the same (possibly) issue that has been closed. I strongly recommend to follow that above advice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.6 Issues Reported for 2.6 area:core area:Triggerer kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants