-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: handle sigterm to allow for graceful shutdown #337
base: main
Are you sure you want to change the base?
Conversation
@@ -22,8 +17,7 @@ def main(): | |||
return | |||
|
|||
try: | |||
doers = args.handler(args) | |||
directing.runController(doers=doers, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this call because the handlers themselves to not return any doers
, they are running them themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
keria/src/keria/app/cli/commands/start.py
Line 114 in 4981d84
directing.runController(doers=agency, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, this makes sense.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #337 +/- ##
==========================================
+ Coverage 93.06% 93.86% +0.79%
==========================================
Files 36 36
Lines 7121 8147 +1026
==========================================
+ Hits 6627 7647 +1020
- Misses 494 500 +6 ☔ View full report in Codecov by Sentry. |
@@ -111,8 +112,15 @@ def launch(args): | |||
bootPassword=args.bootPassword, | |||
bootUsername=args.bootUsername) | |||
|
|||
directing.runController(doers=agency, expire=0.0) | |||
tock = 0.03125 | |||
doist = doing.Doist(limit=0.0, tock=tock, real=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directing.runController
did not return the doist
, so there was no way to access it to be able to call exit
. However, it was only two lines of code, so I thought we could just inline it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController
either return the Doist
or have a parameter to pass a shutdown signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Converting to draft because it does not shutdown the agents, only the agency. So it still needs some adjustments. Will pick up again soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -111,8 +112,15 @@ def launch(args): | |||
bootPassword=args.bootPassword, | |||
bootUsername=args.bootUsername) | |||
|
|||
directing.runController(doers=agency, expire=0.0) | |||
tock = 0.03125 | |||
doist = doing.Doist(limit=0.0, tock=tock, real=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController
either return the Doist
or have a parameter to pass a shutdown signal.
@@ -22,8 +17,7 @@ def main(): | |||
return | |||
|
|||
try: | |||
doers = args.handler(args) | |||
directing.runController(doers=doers, expire=0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, this makes sense.
I just tried this with the vLEI#85 and it worked well. Yes, we need to send the shutdown signal to the agents, though that should be as simple as making a way to call |
I added an initial implementation here that calls |
Looks like the manual class Doist(tyming.Tymist):
def exit(self, deeds=None):
...
if deeds is None:
deeds = self.deeds
while(deeds): # .close each remaining dog in deeds in reverse order
dog, retime, doer = deeds.pop() # pop it off in reverse (right side)
if not dog: # marker deed
continue # skip marker
try:
tock = dog.close() # force GeneratorExit. Maybe log exit tock tyme <---- THIS IS THE CLOSE
except StopIteration:
pass # Hmm? Not supposed to happen!
else: # set done state
try:
doer.done = False # forced close
except AttributeError: # when using bound method for generator function
doer.__func__.done = False # forced close |
And, really, since this is fully cooperative multitasking I believe calling |
For and Doer, any resources that it opens, should be opened inside the Doer's "enter" method and then closed inside the Doer's "exit" method. The Doist, when shutting down i.e exiting will call "exit" on each of its Doers and DoDoers, and DoDoers will call exit on each of their Doers and DoDoers all that way down. The Doist main loop has a try-finally that calls the Doist's exit method. Any break in the main loop will trigger the finally clause, thereby calling exit. This happens on a SigInt (KeyboardInterrupt or Cntl-C, a SystemExit, or any other exception. Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete. But SigInt will not and should exit cleanly. Therefore In your dev ops process manager you should terminate Hio using SigInt not SigTerm. Then you are reasonably assured that all your resources that are closed in Doer. exit() methods will be closed before the process running the Doist terminates. |
@SmithSamuelM thank you for clarifying the default behavior of the Doist. That makes sense to me now. We need to move some of the startup logic for both the Agency and the individual Agent instances to the enter and close lifecycle functions in order to leverage the lifecycle context management of HIO. Then graceful shutdown will be as simple as calling Doist.exit(). I will go make that change. I do support the addition of the SIGTERM handler since the default behavior of both Docker and Kubernetes is to send a SIGTERM, pause for a period (10s and 30s, respectively), and then send a SIGKILL if the process is still alive. SystemD and other process managers that manage daemon-like processes also use SIGTERM so it seems an appropriate choice to trigger the doist.exit() call like @lenkan has written. @SmithSamuelM would it make sense to add a SIGTERM signal handler directly to the HIO library or would you prefer the HIO library only handle SIGINT and leave it up to application libraries to decide if they want to handle SIGTERM? I imagine the broad adoption of SIGTERM by DevOps tools like Docker and K8s may make handling SIGTERM in HIO attractive. Let me know and I'll open a PR there if you would like. |
Are you not mixing up SIGTERM with, for example SIGKILL? I would say that the SIGTERM signal is specifically meant to ask a process to terminate gracefully. Upon receiving it, the process will decide itself when it is time to exit. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
Agree. I also vote for implementing graceful shutdown of keria using SIGTERM. I also believe this is the expected behaviour to the vast majority of users. I am not sure I would expect hio to do it for me though. I would rather explicitly tell hio to stop at the keria level, by calling Doist.exit, like you suggested. |
Extensive testing history indicates that SigInt works because unlike SigTerm, the python interpreter at a low lever catches SigInt and propogates it through the exception handling hierarchy. While there are corner cases where this may be problematic, the Doist loop is avoiding most of those. To explicitly handle SegTerm requires using the signals library which is then being run by the interpreter instead of being in the interpreter. This introduces other corner cases. Often the solution to those corner cases looks like, running two processes or threads, one thread Phil ran into problems with SigTerm when using supervisord which by default was shutting down with sigterm. When he switched to sigint then the problems went away. It may very well be true that the problem of doist.exit() not running to completion is the SigKill not the SigTerm and simply calling registring signals for SigTerm on the Doist main loop may be sufficient without worrying about the corner cases. But it was simply enough for Phil to reconfigure SuperviserD to use SigInt to kill the Doist rather than solve the signals corner cases. |
In the use case where you are terminating KERI is it possible to use SigInt instead of SigTerm? |
See #175 for rationale.
This PR implements handling of the SIGTERM, which is sent by runtimes such as docker to the process to allow it to shut down gracefully. I do not know
hio
, but theDoist
class exposes anexit
function that seems to do the trick.Before this change, KERIA would exit with a non-zero exit when receiving a
SIGTERM
signal.Closes #175