Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart crashed plugins #1204

Closed
wants to merge 15 commits into from
Closed

Conversation

josefkarasek
Copy link

@josefkarasek josefkarasek commented Aug 23, 2023

Description

Changes proposed in this pull request:

  • Use built-in hashicorp Ping() API call to monitor health of source plugins
  • Release resources of crashed plugins
  • Restart crashed plugins
  • Define configurable strategies to approach restart policies

Testing

Add source cm-watcher and executor echo to your comm platform.

Add to your config:

plugins:
    agentRestartPolicy:
      # -- Restart policy type. Allowed values: "restartAgent", "deactivatePlugin".
      type: "deactivatePlugin"
      # -- Number of restarts before policy takes into effect.
      threshold: 5

Recompile plugins and start botkube locally

make build-plugins-single gen-plugins-index
go run cmd/botkube-agent/main.go

Watch botkube pod logs.

Executor testing

@Botkube echo @panic will cause the echo plugin to panic and exit. Wait a few seconds and it will be restarted.
Check again with @Botkube echo hello.

Source testing

Create cm with annotation die: "true".

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    die: "true"
  name: watcher
  namespace: botkube
EOF

When this cm exists, cm-watcher plugin will continue to crash. Remove the cm. The plugin should restart.
Create the cm without the annotation - plugin should send message to specified channel.

Related issue(s)

#878

@josefkarasek josefkarasek marked this pull request as ready for review August 29, 2023 15:31
@josefkarasek josefkarasek requested review from PrasadG193 and a team as code owners August 29, 2023 15:31
@mszostok mszostok self-assigned this Aug 30, 2023
@josefkarasek josefkarasek changed the title Restart crashed source plugins Restart crashed plugins Aug 30, 2023
Copy link
Contributor

@mszostok mszostok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive! 🚀 I like the implementation, and I left only minor comments.

I see such todos:

  • add option to print plugin status or add it to list executors/sources as a new column
  • add e2e tests cases
  • update documentation

let me know if we should take over those items 👍

P.S. in the PR desc you have agentRestartPolicy but it should be restartPolicy and also for the current impl types should start with upper case.

Comment on lines 118 to 121
// if ok := d.runningProcesses.exists(pluginName); ok {
// d.log.Infof("Not starting %q as it was already started.", pluginName)
// continue
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why commented? in general, it makes sense to have it 🤔

}

// botkube/kubectl
// TODO: if other naming scheme is used, it might be safer to try guess the name from channel bindings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do sth about this TODO? or it's more a note?

return restarts < m.policy.Threshold
case config.RestartAgentWhenThresholdReached:
if restarts >= m.policy.Threshold {
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)
Copy link
Contributor

@mszostok mszostok Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we shouldn't panic as it will not run the proper clean-up logic, but this would require a full refactor of the main func, so it's sth to address later 😞

Comment on lines +117 to +118
restarts := m.pluginRestartStats[plugin]
m.pluginRestartStats[plugin]++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be restarted and start from fresh once the plugin after e.g. 2 restarts become healthy?

Because in the current approach I can easily deactivate a plugin that is just flaky 🤔 because it is for the whole plugin history.

restarts := m.pluginRestartStats[plugin]
m.pluginRestartStats[plugin]++

switch m.policy.Type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can normalize it, to all small letters? so even if I type restartAgent instead of RestartAgent it will work.

return restarts < m.policy.Threshold
case config.RestartAgentWhenThresholdReached:
if restarts >= m.policy.Threshold {
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)
m.log.Fatalf("Plugin %q has been restarted %d times and selected restartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)

case <-ctx.Done():
return
case plugin := <-m.executorSupervisorChan:
m.log.Infof("Restarting executor plugin %q...", plugin.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to print the "status" with the number of retries and max retries like (attempt no 2 of max 10)

@josefkarasek josefkarasek added the enhancement New feature or request label Aug 31, 2023
@mszostok mszostok mentioned this pull request Sep 7, 2023
7 tasks
@mszostok
Copy link
Contributor

Code merged in #1236

@mszostok mszostok closed this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants