Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Subdaemon heartbeat #2573

Closed
wants to merge 2 commits into from

Conversation

wenningerk
Copy link
Contributor

This depicts the basic idea of using tracking via ipc to detect basic liveness of subdaemons.

Idea is to still keep tracking subdaemons active even if they are children of pacemakerd where we
up to now were relying on signals sent to the parent if subdaemons are dying.
But just if they are dying and not if they are stalled via whatever might block their mainloop.
pacemakerd updates timestamps sent to sbd on every successful check of a subdaemon -
regardless if running happily or successfully recovered.
The tracking loop periodically fired up is broken up into single checks per timer-expiry in
the hope that this will keep reactivity of pacemakerd at a level so that liveness detection
from sbd won't trigger in case everything is running fine - even at default wd-timeout of 5s.
If we want pacemakerd to attempt subdaemon recovery wd-timeout has to be relaxed
so that ipc-connect timeouts don't trigger it.

Still requires testing as I've never seen a stalled subdaemon being recovered by pacemakerd yet.
Haven't tested without sbd so far and thus sbd always came first to shoot the node (already set
to 90s wd-timeout).
Special handling of subdaemons that are children of pacemakerd might make sense as well -
might e.g. skip checking for a pid.

Build-fix sneaked in deals with the case of multiple times building from a dirty git.
Reusing an already built tar-file from a clean commit is safe as it has the commit in the name and that defines the content.
If we want to get that into 2.1.2 I can spawn it out and make a PR against 2.1.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach is good. The main downside is that child_liveness() currently only attempts an IPC connection, whereas at some point we will likely want to send some sort of ping op as well, ideally using the pcmk_ipc_api_t model. But maybe that should wait until that model is implemented for all the daemons.

The build commit would be fine for 2.1.2

next_child = 0;
}

return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G_SOURCE_CONTINUE

daemons/pacemakerd/pcmkd_subdaemons.c Show resolved Hide resolved
@@ -82,6 +82,7 @@ static char *opts_vgrind[] = { NULL, NULL, NULL, NULL, NULL };

crm_trigger_t *shutdown_trigger = NULL;
crm_trigger_t *startup_trigger = NULL;
time_t subdaemon_check_progress;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to explicitly initialize globals for readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something inside me had refused to initialized to 0 ;-)
But lacking a nice invalid-value initializer - at least to my knowledge - I'll go that route.

@wenningerk
Copy link
Contributor Author

I think the approach is good. The main downside is that child_liveness() currently only attempts an IPC connection, whereas at some point we will likely want to send some sort of ping op as well, ideally using the pcmk_ipc_api_t model. But maybe that should wait until that model is implemented for all the daemons.

That was the idea to wait for the ping implemented throughout the subdaemons so that we don't have to differentiate here between the daemons too much (maybe even try to misuse something existent).
Actually liveness-check via connection-attempts isn't that bad a test as it requires a response and thus detects if the mainloop of the daemon is blocked. Of course priorities are an issue here as well ...

If we consider to keep a persistent connection between pacemakerd and the subdaemons using a ping would of course
be beneficial:

  1. less overhead
  2. priority-wise it would be like other communication between the daemon and its peers - thus a better probe maybe
  3. we could use async messaging via libqb instead of sync connect (no async interface available afaik) so we wouldn't
    have to live with pacemakerd being blocked for a significant time when probing a dead/blocked/overloaded daemon.

The build commit would be fine for 2.1.2

2.1 PR created

@kgaillot
Copy link
Contributor

Do you want to drop the build commit here? 2.1.2 will be pulled back into master hopefully in less than a week

@kgaillot
Copy link
Contributor

superseded by #2588

@kgaillot kgaillot closed this Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants