Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SECURITY SOLUTIONS][Alerts Actions] Fix migration from 7.11.0/7.11.1 to 7.12 #94722

Merged
merged 13 commits into from
Mar 18, 2021

Conversation

XavierM
Copy link
Contributor

@XavierM XavierM commented Mar 16, 2021

Summary

We got this SDH https://github.com/elastic/sdh-kibana/issues/1320, where jira connector failed the migration between 7.11.0 to 7.11.2. We know why this is happening because when we fixed our last blocker for 7.11.2, we did not tag the migration to the right version. We should have used 7.11.0 and not 7.11.2 or we should have added more logic to avoid this problem. Anyway, we figured out a work around for this user https://github.com/elastic/sdh-kibana/issues/1320#issuecomment-799771073.

We are fixing the migration script for 7.12

Checklist

@XavierM XavierM added bug Fixes for quality problems that affect the customer experience v8.0.0 release_note:skip Skip the PR/issue when compiling release notes impact:critical This issue should be addressed immediately due to a critical level of impact on the product. v7.12.0 Team:Threat Hunting Security Solution Threat Hunting Team v7.13.0 labels Mar 16, 2021
@XavierM XavierM requested review from a team as code owners March 16, 2021 16:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems right to me - I was thinking of the same kind of solution.

I noted that we should add a test for alerts migrated in 7.11.2 with the now empty incident object, since I think we will now have these appearing in the field.

Also, assuming we do have incident: {} actions in the field, these seem like they will be busted - they won't validate so can't be executed. Should we do some kind of 7.12.0 migration to make them executable again? Eg, add the required summary property (for Jira anyway) that would include a messaging that this action had previously been broken but is now enabled, has been recovered, but will need to be edited again. Maybe we don't need that? How obvious is it that the action is busted? I don't believe we provide any kind of feedback on whether an existing alert validates, and the only logging you would see would be an error message in the Kibana log about it failing validation.

@@ -571,6 +571,43 @@ describe('7.11.2', () => {
} as SavedObjectUnsanitizedDoc<RawAlert>;
expect(isAnyActionSupportIncidents(doc)).toBe(false);
});

test('it does not transforms alerts when the right structure connectors is already applied', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a customer upgrades to 7.11.2 with the current migration, they'll end up with some actions in alerts where incident is {}, right? Given the current schema, these won't validate when they're used, so should not be executed, but would presumably be candidates somehow fixing - if possible - in another migration (to 7.12.0?)?

In any case, whether we try to fix these or not, seems like we should add a test with an alert with a Jira/ServiceNow/IBM resilient action, where the incident is {}, since it seems like we'll have these in the field, in some cases.

Copy link
Contributor Author

@XavierM XavierM Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, what you are saying? Yes I will do that
I think there is two branch logic here:

  1. if a user went from 7.10.x to 7.11.0 -> they lost all their data in the connector that was the first bug that we tried to fix because we forget to write a migration and user will have to write them from scratch
    I think that we should be able to write them a script really familiar that what we did to bring everything back to normal in this situation too
  2. if a user went from 7.11.0/1 to 7.11.2 -> they will lost again their connector at this time they have two solution using the script from the SDH or re-creating them from scratch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm a little curious how well these incident: {} actions will survive the Kibana UI experience. I'm pretty sure on the back-end we'll fail fast because we are always validating these before we use them productively. But wondering if you edit an alert using one of these actions - what happens? Scared of seeing some kind of null dereference somewhere, I guess ...

I think that would be the decision point as to whether we need to do something for these in a 7.12.0 migration. Like add a placeholder summary or other required fields in the incident, to keep other things from breaking.

Copy link
Contributor Author

@XavierM XavierM Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I tried it out to edit an alert when you have this incident: {} and we come back to normal. We know it is also working because our user who creates the SDH were able to get back to normal after editing all their alerts manually.

And yes when you have incident:{} like in the SDH, we know that the alert get trigger but we never create an action in the associated connector.

Copy link
Contributor

@rudolf rudolf Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this bug, but just want to chime in and agree with @pmuellr . It doesn't usually make sense to fix a bug in an already released migration function (unless it's causing data loss) because that buggy code has already run on users's data. A 7.11.2 -> 7.12 upgrade won't get the bug fix applied to the data. So there needs to be a 7.12 migration to guarantee that all the upgrade paths get this bug fix applied.

I think it makes sense to keep the buggy 7.11.2 migration function just like it was (maybe a comment stating this code is buggy), this way the code makes it obvious that there are kibana 7.11.2 instances with buggy data. The 7.12 migration could then fix the bug with a comment "fixing our 7.11.2 migration bug"

Copy link
Contributor Author

@XavierM XavierM Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolf we ask user to not migrate from 7.11.0/1 to 7.11.2 because if you do then user will loose the data actions for good. We will have now way to bring it back through migration and you will have to run this script to get it back. So we want to stop the bleeding by changing the 7.11.2 migration so user can move forward from 7.11.0/1 to 7.12.x. And of course let them migrate from earlier version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, if this was a data loss bug then yes, all we can do is fix the data loss bug in the 7.11.2 migration, it's too late for users who already upgraded...

@jonathan-buttner
Copy link
Contributor

@pmuellr

I noted that we should add a test for alerts migrated in 7.11.2 with the now empty incident object, since I think we will now have these appearing in the field.

hmm this PR isn't adding a new migration. So say a user upgraded from 7.11.0 to 7.11.2, the alerts will be migrated in 7.11.2 and now have an empty incident object. If a user then upgrades from 7.11.2 to 7.12.0 the alerts won't get run against these code changes because the saved objects will already be marked with a migration version of 7.11.2 (aka these code changes will be skipped).

We can add the test but I don't think this particular migration would get run against those alerts. Maybe it makes sense to add a test to ensure that the migration isn't run when it is tagged with 7.11.2 🤷 ?

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@XavierM
Copy link
Contributor Author

XavierM commented Mar 17, 2021

@elasticmachine merge upstream

@XavierM
Copy link
Contributor Author

XavierM commented Mar 17, 2021

@elasticmachine merge upstream

@XavierM
Copy link
Contributor Author

XavierM commented Mar 17, 2021

@elasticmachine merge upstream

@XavierM XavierM enabled auto-merge (squash) March 17, 2021 23:52
@XavierM XavierM added the auto-backport Deprecated - use backport:version if exact versions are needed label Mar 18, 2021
@XavierM
Copy link
Contributor Author

XavierM commented Mar 18, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@XavierM XavierM merged commit 051be90 into elastic:master Mar 18, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Mar 18, 2021
… to 7.12 (elastic#94722)

* do not modify connector with the right structure

* review trying to bring back incident to live when we can

* manage custom action

* fix cypress test

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Mar 18, 2021
… to 7.12 (elastic#94722)

* do not modify connector with the right structure

* review trying to bring back incident to live when we can

* manage custom action

* fix cypress test

Co-authored-by: Kibana Machine <[email protected]>
@kibanamachine
Copy link
Contributor

💚 Backport successful

7.12 / #94885
7.x / #94886

Successful backport PRs will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Mar 18, 2021
… to 7.12 (#94722) (#94886)

* do not modify connector with the right structure

* review trying to bring back incident to live when we can

* manage custom action

* fix cypress test

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Xavier Mouligneau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience impact:critical This issue should be addressed immediately due to a critical level of impact on the product. release_note:skip Skip the PR/issue when compiling release notes Team:Threat Hunting Security Solution Threat Hunting Team v7.12.0 v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants