-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Subdaemon heartbeat #2573
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach is good. The main downside is that child_liveness() currently only attempts an IPC connection, whereas at some point we will likely want to send some sort of ping op as well, ideally using the pcmk_ipc_api_t model. But maybe that should wait until that model is implemented for all the daemons.
The build commit would be fine for 2.1.2
next_child = 0; | ||
} | ||
|
||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G_SOURCE_CONTINUE
@@ -82,6 +82,7 @@ static char *opts_vgrind[] = { NULL, NULL, NULL, NULL, NULL }; | |||
|
|||
crm_trigger_t *shutdown_trigger = NULL; | |||
crm_trigger_t *startup_trigger = NULL; | |||
time_t subdaemon_check_progress; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to explicitly initialize globals for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something inside me had refused to initialized to 0 ;-)
But lacking a nice invalid-value initializer - at least to my knowledge - I'll go that route.
That was the idea to wait for the ping implemented throughout the subdaemons so that we don't have to differentiate here between the daemons too much (maybe even try to misuse something existent). If we consider to keep a persistent connection between pacemakerd and the subdaemons using a ping would of course
2.1 PR created |
3f4fc68
to
726832a
Compare
Do you want to drop the build commit here? 2.1.2 will be pulled back into master hopefully in less than a week |
superseded by #2588 |
This depicts the basic idea of using tracking via ipc to detect basic liveness of subdaemons.
Idea is to still keep tracking subdaemons active even if they are children of pacemakerd where we
up to now were relying on signals sent to the parent if subdaemons are dying.
But just if they are dying and not if they are stalled via whatever might block their mainloop.
pacemakerd updates timestamps sent to sbd on every successful check of a subdaemon -
regardless if running happily or successfully recovered.
The tracking loop periodically fired up is broken up into single checks per timer-expiry in
the hope that this will keep reactivity of pacemakerd at a level so that liveness detection
from sbd won't trigger in case everything is running fine - even at default wd-timeout of 5s.
If we want pacemakerd to attempt subdaemon recovery wd-timeout has to be relaxed
so that ipc-connect timeouts don't trigger it.
Still requires testing as I've never seen a stalled subdaemon being recovered by pacemakerd yet.
Haven't tested without sbd so far and thus sbd always came first to shoot the node (already set
to 90s wd-timeout).
Special handling of subdaemons that are children of pacemakerd might make sense as well -
might e.g. skip checking for a pid.
Build-fix sneaked in deals with the case of multiple times building from a dirty git.
Reusing an already built tar-file from a clean commit is safe as it has the commit in the name and that defines the content.
If we want to get that into 2.1.2 I can spawn it out and make a PR against 2.1.