-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart crashed plugins #1204
Restart crashed plugins #1204
Conversation
abb630a
to
3e329e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive! 🚀 I like the implementation, and I left only minor comments.
I see such todos:
- add option to print plugin status or add it to
list executors/sources
as a new column - add e2e tests cases
- update documentation
let me know if we should take over those items 👍
P.S. in the PR desc you have agentRestartPolicy
but it should be restartPolicy
and also for the current impl types should start with upper case.
internal/source/scheduler.go
Outdated
// if ok := d.runningProcesses.exists(pluginName); ok { | ||
// d.log.Infof("Not starting %q as it was already started.", pluginName) | ||
// continue | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why commented? in general, it makes sense to have it 🤔
internal/plugin/health_monitor.go
Outdated
} | ||
|
||
// botkube/kubectl | ||
// TODO: if other naming scheme is used, it might be safer to try guess the name from channel bindings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do sth about this TODO? or it's more a note?
internal/plugin/health_monitor.go
Outdated
return restarts < m.policy.Threshold | ||
case config.RestartAgentWhenThresholdReached: | ||
if restarts >= m.policy.Threshold { | ||
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general we shouldn't panic as it will not run the proper clean-up logic, but this would require a full refactor of the main func, so it's sth to address later 😞
restarts := m.pluginRestartStats[plugin] | ||
m.pluginRestartStats[plugin]++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be restarted and start from fresh once the plugin after e.g. 2 restarts become healthy?
Because in the current approach I can easily deactivate a plugin that is just flaky 🤔 because it is for the whole plugin history.
restarts := m.pluginRestartStats[plugin] | ||
m.pluginRestartStats[plugin]++ | ||
|
||
switch m.policy.Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can normalize it, to all small letters? so even if I type restartAgent
instead of RestartAgent
it will work.
internal/plugin/health_monitor.go
Outdated
return restarts < m.policy.Threshold | ||
case config.RestartAgentWhenThresholdReached: | ||
if restarts >= m.policy.Threshold { | ||
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type) | |
m.log.Fatalf("Plugin %q has been restarted %d times and selected restartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type) |
internal/plugin/health_monitor.go
Outdated
case <-ctx.Done(): | ||
return | ||
case plugin := <-m.executorSupervisorChan: | ||
m.log.Infof("Restarting executor plugin %q...", plugin.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to print the "status" with the number of retries and max retries like (attempt no 2 of max 10)
Code merged in #1236 |
Description
Changes proposed in this pull request:
Ping()
API call to monitor health of source pluginsTesting
Add source cm-watcher and executor echo to your comm platform.
Add to your config:
Recompile plugins and start botkube locally
Watch botkube pod logs.
Executor testing
@Botkube echo @panic
will cause the echo plugin to panic and exit. Wait a few seconds and it will be restarted.Check again with
@Botkube echo hello
.Source testing
Create cm with annotation
die: "true"
.When this cm exists, cm-watcher plugin will continue to crash. Remove the cm. The plugin should restart.
Create the cm without the annotation - plugin should send message to specified channel.
Related issue(s)
#878