Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: handle sigterm to allow for graceful shutdown #337

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lenkan
Copy link
Collaborator

@lenkan lenkan commented Dec 11, 2024

See #175 for rationale.

This PR implements handling of the SIGTERM, which is sent by runtimes such as docker to the process to allow it to shut down gracefully. I do not know hio, but the Doist class exposes an exit function that seems to do the trick.

Before this change, KERIA would exit with a non-zero exit when receiving a SIGTERM signal.

Closes #175

@lenkan lenkan requested a review from kentbull December 11, 2024 13:36
@@ -22,8 +17,7 @@ def main():
return

try:
doers = args.handler(args)
directing.runController(doers=doers, expire=0.0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this call because the handlers themselves to not return any doers, they are running them themselves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

directing.runController(doers=agency, expire=0.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, this makes sense.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.86%. Comparing base (18d3ad7) to head (4d68393).
Report is 41 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
+ Coverage   93.06%   93.86%   +0.79%     
==========================================
  Files          36       36              
  Lines        7121     8147    +1026     
==========================================
+ Hits         6627     7647    +1020     
- Misses        494      500       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -111,8 +112,15 @@ def launch(args):
bootPassword=args.bootPassword,
bootUsername=args.bootUsername)

directing.runController(doers=agency, expire=0.0)
tock = 0.03125
doist = doing.Doist(limit=0.0, tock=tock, real=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directing.runController did not return the doist, so there was no way to access it to be able to call exit. However, it was only two lines of code, so I thought we could just inline it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController either return the Doist or have a parameter to pass a shutdown signal.

Copy link
Collaborator

@2byrds 2byrds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lenkan lenkan marked this pull request as draft December 11, 2024 14:13
@lenkan
Copy link
Collaborator Author

lenkan commented Dec 11, 2024

Converting to draft because it does not shutdown the agents, only the agency. So it still needs some adjustments. Will pick up again soon.

Copy link
Contributor

@kentbull kentbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -111,8 +112,15 @@ def launch(args):
bootPassword=args.bootPassword,
bootUsername=args.bootUsername)

directing.runController(doers=agency, expire=0.0)
tock = 0.03125
doist = doing.Doist(limit=0.0, tock=tock, real=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. This is good enough for now. We could consider making a PR to change KERIpy to have runController either return the Doist or have a parameter to pass a shutdown signal.

@@ -22,8 +17,7 @@ def main():
return

try:
doers = args.handler(args)
directing.runController(doers=doers, expire=0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, this makes sense.

@kentbull
Copy link
Contributor

I just tried this with the vLEI#85 and it worked well.

Yes, we need to send the shutdown signal to the agents, though that should be as simple as making a way to call .exit() on their Doer or DoDoer instances.

@kentbull
Copy link
Contributor

I added an initial implementation here that calls Agency.shut on each agent prior to shutting down the main Doist loop. I added a doer.close() to instruct each of the Doers and DoDoers to end their generator loop. If we were to add anything to the shutdown logic to round this out it would be some sort of check on the list of doers that their "done" attribute is "True" which would mean looping in the Agency.shut function while checking that the doers have completed.

@kentbull
Copy link
Contributor

kentbull commented Dec 22, 2024

Looks like the manual doer.close() is not necessary because the Doist does this in the Doist.exit() function in reverse-nested order, so I removed that code. For reference, here is the code from the Doist class

class Doist(tyming.Tymist):
    def exit(self, deeds=None):
        ...
        if deeds is None:
            deeds = self.deeds

        while(deeds):  # .close each remaining dog in deeds in reverse order
            dog, retime, doer = deeds.pop()  # pop it off in reverse (right side)
            if not dog:  # marker deed
                continue  # skip marker
            try:
                tock = dog.close()  # force GeneratorExit. Maybe log exit tock tyme  <---- THIS IS THE CLOSE
            except StopIteration:
                pass  # Hmm? Not supposed to happen!
            else:  # set done state
                try:
                    doer.done = False  # forced close
                except AttributeError:  # when using bound method for generator function
                    doer.__func__.done = False  # forced close

@kentbull
Copy link
Contributor

And, really, since this is fully cooperative multitasking I believe calling Doist.exit() is sufficient to close all of the agents as well. So calling agent.close() per agent should not be necessary because the Doist scheduler should do all of that on its own. I will confirm this with Sam and if true will remove that code from my PR to Daniel L's PR.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented Dec 22, 2024

For and Doer, any resources that it opens, should be opened inside the Doer's "enter" method and then closed inside the Doer's "exit" method. The Doist, when shutting down i.e exiting will call "exit" on each of its Doers and DoDoers, and DoDoers will call exit on each of their Doers and DoDoers all that way down.

The Doist main loop has a try-finally that calls the Doist's exit method. Any break in the main loop will trigger the finally clause, thereby calling exit. This happens on a SigInt (KeyboardInterrupt or Cntl-C, a SystemExit, or any other exception. Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete. But SigInt will not and should exit cleanly. Therefore In your dev ops process manager you should terminate Hio using SigInt not SigTerm. Then you are reasonably assured that all your resources that are closed in Doer. exit() methods will be closed before the process running the Doist terminates.

@kentbull
Copy link
Contributor

kentbull commented Dec 23, 2024

@SmithSamuelM thank you for clarifying the default behavior of the Doist. That makes sense to me now. We need to move some of the startup logic for both the Agency and the individual Agent instances to the enter and close lifecycle functions in order to leverage the lifecycle context management of HIO. Then graceful shutdown will be as simple as calling Doist.exit().

I will go make that change.

I do support the addition of the SIGTERM handler since the default behavior of both Docker and Kubernetes is to send a SIGTERM, pause for a period (10s and 30s, respectively), and then send a SIGKILL if the process is still alive. SystemD and other process managers that manage daemon-like processes also use SIGTERM so it seems an appropriate choice to trigger the doist.exit() call like @lenkan has written.

@SmithSamuelM would it make sense to add a SIGTERM signal handler directly to the HIO library or would you prefer the HIO library only handle SIGINT and leave it up to application libraries to decide if they want to handle SIGTERM?

I imagine the broad adoption of SIGTERM by DevOps tools like Docker and K8s may make handling SIGTERM in HIO attractive. Let me know and I'll open a PR there if you would like.

@lenkan
Copy link
Collaborator Author

lenkan commented Dec 23, 2024

Some system exits like SigTerm will terminate the process prematurely before the exit method has time to complete.

Are you not mixing up SIGTERM with, for example SIGKILL? I would say that the SIGTERM signal is specifically meant to ask a process to terminate gracefully. Upon receiving it, the process will decide itself when it is time to exit. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

I do support the addition of the SIGTERM handler

Agree. I also vote for implementing graceful shutdown of keria using SIGTERM. I also believe this is the expected behaviour to the vast majority of users. I am not sure I would expect hio to do it for me though. I would rather explicitly tell hio to stop at the keria level, by calling Doist.exit, like you suggested.

@SmithSamuelM
Copy link
Contributor

Extensive testing history indicates that SigInt works because unlike SigTerm, the python interpreter at a low lever catches SigInt and propogates it through the exception handling hierarchy. While there are corner cases where this may be problematic, the Doist loop is avoiding most of those. To explicitly handle SegTerm requires using the signals library which is then being run by the interpreter instead of being in the interpreter. This introduces other corner cases. Often the solution to those corner cases looks like, running two processes or threads, one thread
/process that handles the signals and the other thread/process that runs the code.

Phil ran into problems with SigTerm when using supervisord which by default was shutting down with sigterm. When he switched to sigint then the problems went away.

It may very well be true that the problem of doist.exit() not running to completion is the SigKill not the SigTerm and simply calling registring signals for SigTerm on the Doist main loop may be sufficient without worrying about the corner cases. But it was simply enough for Phil to reconfigure SuperviserD to use SigInt to kill the Doist rather than solve the signals corner cases.

@SmithSamuelM
Copy link
Contributor

In the use case where you are terminating KERI is it possible to use SigInt instead of SigTerm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to do graceful shutdown of KERIA
4 participants