-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use proper signal handling and cascade signals to children (Fix #852) #855
Conversation
@mistercrunch any thoughts? |
I'm trying to get someone from our DI team to weight in. |
I agree that this PR makes airflow match common webservers that daemonize by default. However, it's worth noting that other commands such as 'airflow scheduler' and 'airflow flower' don't behave this way. I think it makes sense to handle all the sub-processes the same way, whichever way it goes. And in all cases, a switch to do the opposite ( keep in foreground or daemonize) would be good. |
The switch is already in place. Im working on the other daemons. Sent from my iPhone
|
Cool, sounds good. I also made a note in a related issue that it would be awesome if the child process cleanup strategy (is it only gunicorn that spawns children?) worked universally rather than requiring systemd, eg. I use supervisor. This is just another user's opinion, BTW. It sounds like @mistercrunch is looking for some other opinions on this from inside Airbnb. You might want to wait to hear from them before investing much time.
|
It is not systemd bound, it can easily integrate with init.d scripts. Supervisor I don't have any experience with, but I will make the processes handle signals correct and cascade them to their children. So supervisor should be fine as this is quite a standard way of interfacing. |
Yay signals! That sounds awesome. I was looking at the specific kill calls from systemd and thinking that other process managers would need to add similar explicit clean ups. But perhaps those are backup measures? Anyhow, don't mind me. I need to wrap my head around it better and just run the code to see. :) I can test the PR with supervisor whenever you would like.
|
I'm thinking that it might be relatively simple to handle this issue using the signal handler approach? When the |
Or would setting |
@plypaul the current state of this patch stores the PID in a file for the web server (gunicorn), as many others also do. This allows unicorn's own signal handler to pick up signals and reduces complexity in airflow. Ie. this patch already solves the issue for the web server. I'm working on the other parts which will require some adjustments or the addition of a signal handler. Hopefully I have a full patch at the end the week available. |
Well, it's a little more complicated for the user to manage PID files, so I was looking to see if there were a simpler solution. With the signal handler approach, gunicorn will still get the same signals. |
@plypaul if you start "airflow webserver" you will see that airflow itself will actually exit right away. gunicorn is the main process. So there is no reason to complicate matters by adding another wrapper handling signals. |
@t1m0thy @plypaul @mistercrunch please check and let me know what you think. |
@mistercrunch any feedback? |
@aoen thoughts? Can we test that as soon as we get our staging cluster up and running? |
I don't really have a strong preference either way, but pid files do have the pitfall of becoming stale, so your context manager approach sounded sane to me. |
The implementation (with a ContextManager) I did was uses TimedPIDLockFile no negate the issues of stale lock files. It checks if the PID actually is there that is recorded in the lock file. The older behavior is still available with "-f" or "--foreground" but the issues with LocalExecutors not dying have been solved here as well (multiprocess daemon=True). I would definitely test it. Please note that the systemd files still need to be updated for these changes. |
|
@aoen Have you been able to test this? @criccomini (I'll clean up later) |
|
|
Our staging cluster has been completed so going to get this testing done soon, thanks for the poke. |
+1 on my end. Ran it and I don't see zombies anymore (#1157) |
|
|
Coverage decreased (-0.5%) to 67.173% when pulling 1991c58bea53a5c349df2044fb853b2895d7cf13 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master. |
Coverage decreased (-0.6%) to 67.122% when pulling 1991c58bea53a5c349df2044fb853b2895d7cf13 on bolkedebruin:ISSUE-852 into d49e654 on airbnb:master. |
|
Coverage increased (+1.6%) to 69.289% when pulling 4ec3fd954273948140debbe22c807ffdb6319dcc on bolkedebruin:ISSUE-852 into 8afee07 on airbnb:master. |
Airflow spawns childs in the form of a webserver, scheduler, and executors. If the parent gets terminated (SIGTERM) it needs to properly propagate the signals to the childs otherwise these will get orphaned and end up as zombie processes. This patch resolves that issue. In addition Airflow does not store the PID of its services so they can be managed by traditional unix systems services like rc.d / upstart / systemd and the likes. This patch adds the "--pid" flag. By default it stores the PID in ~/airflow/airflow-<service>.pid Lastly, the patch adds support for different log file locations: log, stdout, and stderr (respectively: --log-file, --stdout, --stderr). By default these are stored in ~/airflow/airflow-<service>.log/out/err. * Resolves ISSUE-852
|
Exciting. Thanks @bolkedebruin. |
@artwr let's see how exciting it gets ;-) |
This fixes ISSUE-#852 by starting gunicorn with a pid file and to exit immediately after that. This makes airflow's webserver behave more standard. To retain old behavior you can specify "-f" or "--foreground".