[EPM] handle failure cases in the package update process #64213

neptunian · 2020-04-22T16:39:48Z

As part of #59910,

If a package fails to install for whatever reason, there needs to be a "rollback" as the install will be partial and they could be in a state where they have different assets of both versions. @ruflin said that for now, in a blog post, they’ll have to uninstall the old package and then try to install again (the new version doesn't get written to the saved object unless it was successful)

elasticmachine · 2020-04-22T16:39:50Z

Pinging @elastic/ingest-management (Feature:EPM)

neptunian · 2020-06-26T18:06:51Z

Because installing is not atomic and updating is a process of overwriting assets with the same names, there is not a way to guarantee the installation is going to be successful before assets are installed/updated. "Rolling back" during an unsuccessful update would be trying to reinstall the previous package. So long as we use the same name for the assets, we can't install two versions at the same time and only use the new one when its successful. So I think our best option is to inform the user the package installation was bad and to try to install it again. This means they could be in a state with assets of two package versions until its fixed.

For beta I am proposing:

introduce a way to track the progress of install/update and save it. this can exist as a field in the epm-package SO.
if there is an unexpected server error (500), set the status to error, and have the UI reflect some kind of messaging for the package that there was a problem installing/updating it, and tell them they need to install again to fix the install. Currently we tell them there was a problem installing the package, with a warning toast (should be an error toast?), but it doesn't persist as the state is only in the UI. Need to decide where all this messaging should be. If its only on the package detail page, they may not discover it. Perhaps a section on the dashboard page.

Other scenarios

if kibana crashes, there isn't a way of knowing that the package install/update failed, other than checking when kibana starts again looking through all the packages' statuses, and seeing that it is still installing. the package could just still be installing (they clicked install and refreshed kibana) or the install could have been interrupted unexpectedly with no error handling. perhaps we could differentiate between the two checking the time as a package should not take a very long time to install.

@ruflin @ph would like to hear your thoughts

ph · 2020-06-26T19:19:45Z

I think we have discussed that before but just to clarify the process here, when you try to update the package v1 to v2 the following scenarioes:

Scenario 1: Happy path:

v1 assets are removed?
v2 assets are installed (serially or in parallel)
Everything is 💚

Scenario 2: Transient error (networks) Kibana works.

v1 assets are removed.
v2 assets are installing (oops one of them failed)
Set to state to error.
Display an error to the user.

The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc.

Scenario 3: Kibana hard crash during installationg.

v1 assets are removed.
v2 assets are installing, kibana crash during the process.

The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc.

I think we might have a little more than just an error state.

Now, I've discussed with you that it would be nice to have an atomic operation, I wonder could we build one on top of it base on the above state machine?

Could each steps had a rollback methods? (rollback dashboard, rollback template) Either they take the "previous" templates or they keep in memory the currently installed artifacts.
Having the rollback would solve the problem if kibana still keep running.
If kibana crash during the installing we can recover with the current state of the package?

neptunian · 2020-06-28T18:41:03Z

I don't think your scenarios cover an unexpected server error that isn't transient, but also doesn't cause a crash (all errors in the install endpoint are caught, so it should not cause kibana to crash itself).

To clarify, the assets are not removed before new ones are installed/updated. They are created or updated with a PUT. Since data is incoming, removing all the templates or ingest pipeline would be bad. When we were discussing about deleting assets first, I was referring to the Kibana saved objects assets (dashboards, visualizations, etc).

Could each steps had a rollback methods? (rollback dashboard, rollback template) Either they take the "previous" templates or they keep in memory the currently installed artifacts.

I thought we had discussed we should not try to automatically recover but prompt the user to take some action to fix it.

I wouldn't have the previous assets in memory during an update. I would need to fetch the package from the registry and install them again. It sounds like you are thinking about having some state management that keeps track of each step in the install process and undo the changes. I think it might make sense to just attempt to reinstall the previous package if we're trying to get back to a working package as it would be less complex and be a similar process of updating assets from the previous version. Of course, this could also error for some unknown reason, and at that point we just need to let the user known of this error state.

If kibana crash during the installing we can recover with the current state of the package?

by state do you mean where it left off in the installation process? No. There is no state management, currently. If you mean whether or not it installed successfully, I tried to address that above:

if kibana crashes, there isn't a way of knowing that the package install/update failed, other than checking when kibana starts again looking through all the packages' statuses, and seeing that it is still installing. the package could just still be installing (they clicked install and refreshed kibana) or the install could have been interrupted unexpectedly with no error handling. perhaps we could differentiate between the two checking the time as a package should not take a very long time to install.

ruflin · 2020-06-29T12:27:52Z

From my perspective there are 2 different ways things can go wrong:

Affects the data
Does not affect the data

For the part that affects the data (ingest-pipeline, mapping, template) we have a predefined order. I think this flow prevents us from bad things happening. But if the upgrade was fully applied, we CAN NOT roll back as otherwise we could have new shippers sending to an old ingest pipeline.

The second part is all the other assets. If a dashboard is not fully load, it is not great, but it can be just loaded again. I think we should optimise for rolling forward / reinstalling.

My main concerns are around leaving old assets behind and not providing good error messages to the user.

neptunian · 2020-06-29T16:00:11Z

For the part that affects the data (ingest-pipeline, mapping, template) we have a predefined order. I think this flow prevents us from bad things happening. But if the upgrade was fully applied, we CAN NOT roll back as otherwise we could have new shippers sending to an old ingest pipeline.

Yes, I was thinking if we did want a rollback solution this particular order of events would have to be treated as one rollback step that has to occur in order starting from the first step of creating the pipelines, which makes things a bit complicated and becomes similar to reinstalling.

My main concerns are around leaving old assets behind and not providing good error messages to the user.

Most assets are going to be updated, so the assets left behind would be if the new package had added new datasets or kibana assets that don't exist in the previous version. The new datasets assets would be there unused since the agent configuration would not be able to use it. The new Kibana assets would remain. Depending how #65035 is handled, we could remove all the assets that are associated with this package based on the package name we are going to associate with the asset?

ruflin · 2020-06-30T19:17:54Z

In any case, I think we must always know all the assets that are installed and only remove it from tracking if we really know we removed them to ensure we always cleanup after "us".

neptunian added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team labels Apr 22, 2020

neptunian self-assigned this Apr 22, 2020

ruflin added the Ingest Management:beta1 label Apr 23, 2020

ruflin unassigned neptunian Apr 23, 2020

neptunian self-assigned this Jun 26, 2020

neptunian mentioned this issue Jul 13, 2020

[Ingest Manager] Refactor Package Installation #71521

Merged

neptunian mentioned this issue Aug 11, 2020

[Ingest Manager] Reinstalling packages when Kibana crashes #74792

Closed

neptunian closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPM] handle failure cases in the package update process #64213

[EPM] handle failure cases in the package update process #64213

neptunian commented Apr 22, 2020 •

edited

Loading

elasticmachine commented Apr 22, 2020

neptunian commented Jun 26, 2020

ph commented Jun 26, 2020

neptunian commented Jun 28, 2020

ruflin commented Jun 29, 2020

neptunian commented Jun 29, 2020

ruflin commented Jun 30, 2020

[EPM] handle failure cases in the package update process #64213

[EPM] handle failure cases in the package update process #64213

Comments

neptunian commented Apr 22, 2020 • edited Loading

elasticmachine commented Apr 22, 2020

neptunian commented Jun 26, 2020

ph commented Jun 26, 2020

Scenario 1: Happy path:

Scenario 2: Transient error (networks) Kibana works.

Scenario 3: Kibana hard crash during installationg.

neptunian commented Jun 28, 2020

ruflin commented Jun 29, 2020

neptunian commented Jun 29, 2020

ruflin commented Jun 30, 2020

neptunian commented Apr 22, 2020 •

edited

Loading