Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops] Keep task document when enabling/disabling rules #139826

Merged
merged 27 commits into from
Sep 12, 2022

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Aug 31, 2022

Resolves #110096

🚨🚨🚨 A migration has been added to add enabled: true to existing task manager documents with status: 'claiming' | 'idle' | 'running'. It sets enabled: false to tasks with status: 'failed' | 'unrecognized' This might warrant extra review :) 🚨🚨🚨

Summary

This PR adds an enabled field to all task documents. This is a boolean indicating whether the task is currently enabled or disabled. The task claim query has been updated to only claim enabled tasks.

Task manager changes:

  • taskSchedule.schedule and taskSchedule.bulkSchedule have been updated to set enabled: true for newly scheduled tasks (unless enabled: false is explicitly set when scheduling.
  • new function taskSchedule.bulkEnableDisable to update this flag for specific task ids

Rules client changes:

  • updated disable function to stop deleting the underlying task manager document when a rule is disabled (with one exception, left comment on the PR). Instead, it updates the enabled flag on the task document to be false.
  • updated enable function to stop scheduling a new task when a rule is enabled (except when no task exists). Instead, it updates the enabled flag on the task document to be true.

To Verify

  1. Run an older version of Kibana and create some rules with actions. Force a failure in an action in order to get a task doc with status failed. Create some rules and disable at least one of them.
  2. Run this PR with the same data as step1. Verify that all task docs have a new enabled field. Verify that failed tasks have enabled: false but other tasks have enabled: true. Verify that the rules that were enabled before are still running.
  3. Enable the disabled rules from before. Verify that a task doc gets created for these.
  4. Create some new rules. Verify that enabling and disabling both old and new rules does not delete the task doc, it just flips the enabled flag on the task.

Checklist

updatedBy: await this.getUserName(),
updatedAt: new Date().toISOString(),
}),
{ version }
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update to the disable logic takes into account that there may be pre 8.0 rules running where the scheduled task ID does not equal the rule ID. In these cases, we still want to remove the task document so that a new one matching the rule id can be created on enable.

If the scheduledTaskId already matches the rule ID, this will set the task to disabled.

@@ -2035,6 +2035,23 @@ export class RulesClient {
} catch (e) {
throw e;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update to the enable logic takes into account the fact that there may be already disabled rules with no corresponding task.

If the rule is disabled with no corresponding task, it's scheduledTaskId will be undefined. In this case, we want to schedule a task on enable.

If a rule somehow has a scheduledTaskId defined but it doesn't actually exist, we want to schedule a task on enable.

Finally, if a rule has a scheduledTaskId defined and the task exists, we enable the task.

@ymao1 ymao1 changed the title Task manager/enabled [Response Ops] Keep task document when enabling/disabling rules Sep 1, 2022
@ymao1 ymao1 self-assigned this Sep 1, 2022
@ymao1 ymao1 added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.5.0 labels Sep 1, 2022
@ymao1
Copy link
Contributor Author

ymao1 commented Sep 1, 2022

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 6, 2022

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 6, 2022

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 7, 2022

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2022

@elasticmachine merge upstream

@pmuellr
Copy link
Member

pmuellr commented Sep 8, 2022

I'm not sure of the pressing need for this any more, but we have been using "disable then re-enable" as a way of "resetting" a rule's task state. Feels like we may still need this, probably just for problematic cases. Should we open a follow-on PR, or do we even need this, or are there alternatives?

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested locally, works as expected.

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2022

I'm not sure of the pressing need for this any more, but we have been using "disable then re-enable" as a way of "resetting" a rule's task state. Feels like we may still need this, probably just for problematic cases. Should we open a follow-on PR, or do we even need this, or are there alternatives?

@pmuellr We discussed this in the issue: #110096 (comment) and it seems like the consensus is that losing task state on disable is considered a bug that we're fixing. How often are we suggesting disable/reenable to clear task state?

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 8, 2022

I'm not sure of the pressing need for this any more, but we have been using "disable then re-enable" as a way of "resetting" a rule's task state. Feels like we may still need this, probably just for problematic cases. Should we open a follow-on PR, or do we even need this, or are there alternatives?

@pmuellr Re-read your question and I think you're asking if we should open a followup issue for being able to reset task state, which I definitely can :)

@pmuellr
Copy link
Member

pmuellr commented Sep 8, 2022

Re-read your question and I think you're asking if we should open a followup issue for being able to reset task state, which I definitely can :)

Yes, basically what I'm asking :-)

I know the typical use case WE use it for is probably just basic sync issues between the rule and task. I'm not sure folks have really used it JUST to reset the task state - for example, if there's some bad rule-specific state, or to "reset all the alerts" or something. I don't think we've really seen a need for that. Which is all a new "reset task" API would do, I would think.

And wasn't sure if there was already some other way of accomplishing that anyway.

Sounds like it's something we should consider though, so opening an issue to track seems appropriate.

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 9, 2022

@pmuellr issue created: #140402

@ymao1
Copy link
Contributor Author

ymao1 commented Sep 9, 2022

@elasticmachine merge upstream

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; left some questions ...

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
taskManager 82 83 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 0cf0e3d into elastic:main Sep 12, 2022
@ymao1 ymao1 deleted the task-manager/enabled branch September 12, 2022 13:35
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[alerting] stop deleting / re-creating tasks when rules are disabled / re-enabled
6 participants