New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Restart crashed plugins #1204

Closed

josefkarasek wants to merge 15 commits into kubeshop:main from josefkarasek:watch-plugins

josefkarasek commented Aug 23, 2023 •

edited

Loading

Description

Changes proposed in this pull request:

Use built-in hashicorp Ping() API call to monitor health of source plugins
Release resources of crashed plugins
Restart crashed plugins
Define configurable strategies to approach restart policies

Testing

Add source cm-watcher and executor echo to your comm platform.

Add to your config:

plugins:
    agentRestartPolicy:
      # -- Restart policy type. Allowed values: "restartAgent", "deactivatePlugin".
      type: "deactivatePlugin"
      # -- Number of restarts before policy takes into effect.
      threshold: 5

Recompile plugins and start botkube locally

make build-plugins-single gen-plugins-index
go run cmd/botkube-agent/main.go

Watch botkube pod logs.

Executor testing

@Botkube echo @panic will cause the echo plugin to panic and exit. Wait a few seconds and it will be restarted.
Check again with @Botkube echo hello.

Source testing

Create cm with annotation die: "true".

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    die: "true"
  name: watcher
  namespace: botkube
EOF

When this cm exists, cm-watcher plugin will continue to crash. Remove the cm. The plugin should restart.
Create the cm without the annotation - plugin should send message to specified channel.

Related issue(s)

josefkarasek requested a review from mszostok

August 23, 2023 12:33

josefkarasek marked this pull request as ready for review

August 29, 2023 15:31

josefkarasek requested review from PrasadG193 and a team as code owners

August 29, 2023 15:31

mszostok self-assigned this

Josef Karasek added 12 commits

August 30, 2023 14:27


          Restart crashed source plugins

310f1da


          dispatch plugins

40e8d26


          schedule restarted plugins

e191ffd


          fix keys

29bb326


          Add health monitor for executors

8b28fc8


          define restart policies

8646b4d


          re-gen helm docs

dfddc8d


          fix lint

e03fe8d


          fix execute config test

349a8f4


          cleanup

6fb7da9


          process chart

c03522d


          default time interval

3e329e1

josefkarasek force-pushed the watch-plugins branch from abb630a to 3e329e1 Compare

August 30, 2023 13:18

josefkarasek changed the title ~~Restart crashed source plugins~~ Restart crashed plugins


          start existing plugins

c071fb1

mszostok approved these changes

View reviewed changes

Contributor

mszostok left a comment

Very impressive! 🚀 I like the implementation, and I left only minor comments.

I see such todos:

add option to print plugin status or add it to list executors/sources as a new column
add e2e tests cases
update documentation

let me know if we should take over those items 👍

P.S. in the PR desc you have agentRestartPolicy but it should be restartPolicy and also for the current impl types should start with upper case.

internal/source/scheduler.go Outdated

Comment on lines 118 to 121

+              			// if ok := d.runningProcesses.exists(pluginName); ok {
+              			// 	d.log.Infof("Not starting %q as it was already started.", pluginName)
+              			// 	continue
+              			// }

Contributor

mszostok Aug 30, 2023

why commented? in general, it makes sense to have it 🤔

internal/plugin/health_monitor.go Outdated

+              			}
+              			// botkube/kubectl
+              			// TODO: if other naming scheme is used, it might be safer to try guess the name from channel bindings

Contributor

mszostok Aug 30, 2023

should we do sth about this TODO? or it's more a note?

internal/plugin/health_monitor.go Outdated

+              		return restarts < m.policy.Threshold
+              	case config.RestartAgentWhenThresholdReached:
+              		if restarts >= m.policy.Threshold {
+              			m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)

Contributor

mszostok Aug 30, 2023 •

edited

Loading

in general we shouldn't panic as it will not run the proper clean-up logic, but this would require a full refactor of the main func, so it's sth to address later 😞

internal/plugin/health_monitor.go

Comment on lines +117 to +118

		restarts := m.pluginRestartStats[plugin]
		m.pluginRestartStats[plugin]++

Contributor

mszostok Aug 30, 2023

should it be restarted and start from fresh once the plugin after e.g. 2 restarts become healthy?

Because in the current approach I can easily deactivate a plugin that is just flaky 🤔 because it is for the whole plugin history.

internal/plugin/health_monitor.go

+              	restarts := m.pluginRestartStats[plugin]
+              	m.pluginRestartStats[plugin]++
+              	switch m.policy.Type {

Contributor

mszostok Aug 31, 2023

maybe we can normalize it, to all small letters? so even if I type restartAgent instead of RestartAgent it will work.

internal/plugin/health_monitor.go Outdated

+              		return restarts < m.policy.Threshold
+              	case config.RestartAgentWhenThresholdReached:
+              		if restarts >= m.policy.Threshold {
+              			m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)

Contributor

mszostok Aug 31, 2023

Suggested change

      
            			m.log.Fatalf("Plugin %q has been restarted %d times and selected agentRestartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)
          
            			m.log.Fatalf("Plugin %q has been restarted %d times and selected restartPolicy is %q. Exiting...", plugin, restarts, m.policy.Type)

internal/plugin/health_monitor.go Outdated

+              		case <-ctx.Done():
+              			return
+              		case plugin := <-m.executorSupervisorChan:
+              			m.log.Infof("Restarting executor plugin %q...", plugin.name)

Contributor

mszostok Aug 31, 2023

would be nice to print the "status" with the number of retries and max retries like (attempt no 2 of max 10)


          use unique names for running dispatches

2f609ca

josefkarasek added the enhancement label


          add extra index

9dd1d44

mszostok mentioned this pull request

Restart crashed plugins #1236

Merged

7 tasks

Contributor

mszostok commented Sep 13, 2023

Code merged in #1236

mszostok closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels