Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch and respawn started plugins that exited #878

Closed
mszostok opened this issue Dec 6, 2022 · 0 comments
Closed

Watch and respawn started plugins that exited #878

mszostok opened this issue Dec 6, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@mszostok
Copy link
Contributor

mszostok commented Dec 6, 2022

Overview

Watch and respawn started plugin sources and executors that exited. Currently, for sources, the Dispatcher service only get a client and starts the streaming process.

sourceClient, err := d.manager.GetSource(pluginName)
if err != nil {
return fmt.Errorf("while getting source client for %s: %w", pluginName, err)
}
out, err := sourceClient.Stream(ctx, pluginConfigs)
if err != nil {
return fmt.Errorf("while opening stream for %s: %w", pluginName, err)
}
go func() {
for {
select {
case event := <-out.Output:
log.WithField("event", string(event)).Debug("Dispatching received event...")
d.dispatch(ctx, event, sources)
case <-ctx.Done():
return
}
}
}()

However, each plugin is a separate binary that starts as a sub-process. As a result, if there is e.g. panic in a given plugin, it closes this sub-process, and we will not receive new events from this source until the Botkube process will be restarted. Also, executors should be covered too.

We should detect such situations and add a retry mechanism:

  • It can be as simple as just starting the binary once again.

  • We can also think about sth similar to what K8s does when the Pod is crashing.

  • related issue on go-plugin repo: Handling plugin crashes hashicorp/go-plugin#31. We can take a look on how other projects are handling such situation.

Acceptance Criteria

  • detect crashing plugin and restart them using simple retry mechanism
  • document implemented approach
  • e2e test coverage
  • test executor: update echo plugin: add code to fail with a specific command
  • consider testing source as well: update config map watcher to fail on a given annotated cfg map, test that it worked after deleting it
  • have a retry threshold and after X retries, we can give up with a given source/executor restarting and send a message to Slack that such source was deactivated because of constant crashing.
    • send message via Bot that a given plugin is crashing and related events may be lost
      • send the message only to the bind channels
    • make it configurable for user
      • default strategy: keep running Botkube anyway, but ofc send message
      • strategy: if one plugin is failing, exit Botkube if user prefers that

Reason

Make the source plugin dispatcher more reliable to avoid the situation when our users stop receiving source events because of some random plugin crashes.

@mszostok mszostok added enhancement New feature or request needs-triage Relates to issues that should be refined labels Dec 6, 2022
@mszostok mszostok mentioned this issue Dec 6, 2022
11 tasks
@mszostok mszostok changed the title Watch and respawn started plugin source that exited Watch and respawn started plugins that exited Dec 15, 2022
@mszostok mszostok added this to the v0.18.0 milestone Dec 15, 2022
@mszostok mszostok removed the needs-triage Relates to issues that should be refined label Dec 15, 2022
@pkosiec pkosiec modified the milestone: v0.18.0 Jan 10, 2023
@pkosiec pkosiec removed this from the v0.18.0 milestone Feb 15, 2023
@pkosiec pkosiec added this to the v1.4.0 milestone Aug 16, 2023
@josefkarasek josefkarasek self-assigned this Aug 21, 2023
@pkosiec pkosiec modified the milestones: v1.4.0, v1.5.0 Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants