Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use proper signal handling and cascade signals to children (Fix #852) #855

Merged
merged 1 commit into from
Apr 6, 2016

Conversation

bolkedebruin
Copy link
Contributor

This fixes ISSUE-#852 by starting gunicorn with a pid file and to exit immediately after that. This makes airflow's webserver behave more standard. To retain old behavior you can specify "-f" or "--foreground".

  • the pid file is placed in AIRFLOW_HOME/airflow-webserver.pid or if you are running systemd in /run/airflow/webserver.pid. If you want to store it somewhere else please use "--pid" as an argument.
  • systemd profile for the webserver is now a bit more secure
  • if you are running systemd make sure to follow the instructions otherwise you might have some write permission issues for storing the pid file.

@bolkedebruin
Copy link
Contributor Author

@mistercrunch any thoughts?

@mistercrunch
Copy link
Member

I'm trying to get someone from our DI team to weight in.

@t1m0thy
Copy link
Contributor

t1m0thy commented Jan 29, 2016

I agree that this PR makes airflow match common webservers that daemonize by default. However, it's worth noting that other commands such as 'airflow scheduler' and 'airflow flower' don't behave this way. I think it makes sense to handle all the sub-processes the same way, whichever way it goes. And in all cases, a switch to do the opposite ( keep in foreground or daemonize) would be good.

@bolkedebruin
Copy link
Contributor Author

The switch is already in place. Im working on the other daemons.

Sent from my iPhone

On 29 jan. 2016, at 03:09, Tim Hirzel [email protected] wrote:

I agree that this PR makes airflow match common webservers that daemonize by default. However, it's worth noting that other commands such as 'airflow scheduler' and 'airflow flower' don't behave this way. I think it makes sense to handle all the sub-processes the same way, whichever way it goes. And in all cases, a switch to do the opposite ( keep in foreground or daemonize) would be good.


Reply to this email directly or view it on GitHub.

@t1m0thy
Copy link
Contributor

t1m0thy commented Jan 29, 2016

Cool, sounds good. I also made a note in a related issue that it would be awesome if the child process cleanup strategy (is it only gunicorn that spawns children?) worked universally rather than requiring systemd, eg. I use supervisor.

This is just another user's opinion, BTW. It sounds like @mistercrunch is looking for some other opinions on this from inside Airbnb. You might want to wait to hear from them before investing much time.

On Jan 29, 2016, at 01:45, bolkedebruin [email protected] wrote:

The switch is already in place. Im working on the other daemons.

Sent from my iPhone

On 29 jan. 2016, at 03:09, Tim Hirzel [email protected] wrote:

I agree that this PR makes airflow match common webservers that daemonize by default. However, it's worth noting that other commands such as 'airflow scheduler' and 'airflow flower' don't behave this way. I think it makes sense to handle all the sub-processes the same way, whichever way it goes. And in all cases, a switch to do the opposite ( keep in foreground or daemonize) would be good.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

@bolkedebruin
Copy link
Contributor Author

It is not systemd bound, it can easily integrate with init.d scripts. Supervisor I don't have any experience with, but I will make the processes handle signals correct and cascade them to their children. So supervisor should be fine as this is quite a standard way of interfacing.

@t1m0thy
Copy link
Contributor

t1m0thy commented Jan 29, 2016

Yay signals! That sounds awesome.

I was looking at the specific kill calls from systemd and thinking that other process managers would need to add similar explicit clean ups. But perhaps those are backup measures? Anyhow, don't mind me. I need to wrap my head around it better and just run the code to see. :)

I can test the PR with supervisor whenever you would like.

On Jan 29, 2016, at 08:17, bolkedebruin [email protected] wrote:

It is not systemd bound, it can easily integrate with init.d scripts. Supervisor I don't have any experience with, but I will make the processes handle signals correct and cascade them to their children. So supervisor should be fine as this is quite a standard way of interfacing.


Reply to this email directly or view it on GitHub.

@plypaul
Copy link
Contributor

plypaul commented Feb 4, 2016

I'm thinking that it might be relatively simple to handle this issue using the signal handler approach?

When the gunicorn process is launched, we record the PID in a variable and configure a signal handler using signal.signal to fire when airflow gets a SIGTERM. The signal handler calls os.kill with the PID that we saved earlier.

@plypaul
Copy link
Contributor

plypaul commented Feb 4, 2016

Or would setting ExecStop to pkill -TERM -P $MAINPID have the same effect?

@bolkedebruin
Copy link
Contributor Author

@plypaul the current state of this patch stores the PID in a file for the web server (gunicorn), as many others also do. This allows unicorn's own signal handler to pick up signals and reduces complexity in airflow. Ie. this patch already solves the issue for the web server. I'm working on the other parts which will require some adjustments or the addition of a signal handler. Hopefully I have a full patch at the end the week available.

@plypaul
Copy link
Contributor

plypaul commented Feb 4, 2016

Well, it's a little more complicated for the user to manage PID files, so I was looking to see if there were a simpler solution. With the signal handler approach, gunicorn will still get the same signals.

@bolkedebruin
Copy link
Contributor Author

@plypaul if you start "airflow webserver" you will see that airflow itself will actually exit right away. gunicorn is the main process. So there is no reason to complicate matters by adding another wrapper handling signals.

@bolkedebruin
Copy link
Contributor Author

@t1m0thy @plypaul @mistercrunch please check and let me know what you think.

@bolkedebruin
Copy link
Contributor Author

@mistercrunch any feedback?

@mistercrunch
Copy link
Member

@aoen thoughts? Can we test that as soon as we get our staging cluster up and running?

@aoen
Copy link
Contributor

aoen commented Feb 16, 2016

I don't really have a strong preference either way, but pid files do have the pitfall of becoming stale, so your context manager approach sounded sane to me.

@bolkedebruin
Copy link
Contributor Author

The implementation (with a ContextManager) I did was uses TimedPIDLockFile no negate the issues of stale lock files. It checks if the PID actually is there that is recorded in the lock file.

The older behavior is still available with "-f" or "--foreground" but the issues with LocalExecutors not dying have been solved here as well (multiprocess daemon=True).

I would definitely test it. Please note that the systemd files still need to be updated for these changes.

@bolkedebruin bolkedebruin changed the title Fix Issue 852 Use proper signal handling and cascade signals to children (Fix #852) Mar 30, 2016
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.27% when pulling 361ac79 on bolkedebruin:ISSUE-852 into ec2e02a on airbnb:master.

@bolkedebruin
Copy link
Contributor Author

@aoen Have you been able to test this? @criccomini

(I'll clean up later)

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.28% when pulling 76e1b40 on bolkedebruin:ISSUE-852 into 3813d51 on airbnb:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.28% when pulling 76e1b40 on bolkedebruin:ISSUE-852 into 3813d51 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.8%) to 66.347% when pulling 361ac79 on bolkedebruin:ISSUE-852 into ec2e02a on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 66.365% when pulling 76e1b40 on bolkedebruin:ISSUE-852 into 3813d51 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 66.365% when pulling 76e1b40 on bolkedebruin:ISSUE-852 into 3813d51 on airbnb:master.

@aoen
Copy link
Contributor

aoen commented Apr 1, 2016

Our staging cluster has been completed so going to get this testing done soon, thanks for the poke.

@bolkedebruin bolkedebruin added this to the Airflow 2.0 milestone Apr 5, 2016
@criccomini
Copy link
Contributor

+1 on my end. Ran it and I don't see zombies anymore (#1157)

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.27% when pulling 5443840 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.27% when pulling 1991c58 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 67.173% when pulling 1991c58bea53a5c349df2044fb853b2895d7cf13 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 67.122% when pulling 1991c58bea53a5c349df2044fb853b2895d7cf13 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.29% when pulling 4ec3fd9 on bolkedebruin:ISSUE-852 into 8afee07 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.6%) to 69.289% when pulling 4ec3fd954273948140debbe22c807ffdb6319dcc on bolkedebruin:ISSUE-852 into 8afee07 on airbnb:master.

Airflow spawns childs in the form of a webserver, scheduler, and executors.
If the parent gets terminated (SIGTERM) it needs to properly propagate the
signals to the childs otherwise these will get orphaned and end up as
zombie processes. This patch resolves that issue.

In addition Airflow does not store the PID of its services so they can be
managed by traditional unix systems services like rc.d / upstart / systemd
and the likes. This patch adds the "--pid" flag. By default it stores the
PID in ~/airflow/airflow-<service>.pid

Lastly, the patch adds support for different log file locations: log,
stdout, and stderr (respectively: --log-file, --stdout, --stderr). By
default these are stored in ~/airflow/airflow-<service>.log/out/err.

* Resolves ISSUE-852
@bolkedebruin bolkedebruin merged commit 4865ee6 into apache:master Apr 6, 2016
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.08% when pulling e8c1144 on bolkedebruin:ISSUE-852 into 81ff5cc on airbnb:master.

@artwr
Copy link
Contributor

artwr commented Apr 6, 2016

Exciting. Thanks @bolkedebruin.

@bolkedebruin
Copy link
Contributor Author

@artwr let's see how exciting it gets ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.