Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPM] handle failure cases in the package update process #64213

Closed
neptunian opened this issue Apr 22, 2020 · 7 comments
Closed

[EPM] handle failure cases in the package update process #64213

neptunian opened this issue Apr 22, 2020 · 7 comments
Assignees
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@neptunian
Copy link
Contributor

neptunian commented Apr 22, 2020

As part of #59910,

If a package fails to install for whatever reason, there needs to be a "rollback" as the install will be partial and they could be in a state where they have different assets of both versions. @ruflin said that for now, in a blog post, they’ll have to uninstall the old package and then try to install again (the new version doesn't get written to the saved object unless it was successful)

@neptunian neptunian added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team labels Apr 22, 2020
@neptunian neptunian self-assigned this Apr 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Feature:EPM)

@neptunian
Copy link
Contributor Author

Because installing is not atomic and updating is a process of overwriting assets with the same names, there is not a way to guarantee the installation is going to be successful before assets are installed/updated. "Rolling back" during an unsuccessful update would be trying to reinstall the previous package. So long as we use the same name for the assets, we can't install two versions at the same time and only use the new one when its successful. So I think our best option is to inform the user the package installation was bad and to try to install it again. This means they could be in a state with assets of two package versions until its fixed.

For beta I am proposing:

  • introduce a way to track the progress of install/update and save it. this can exist as a field in the epm-package SO.
  • if there is an unexpected server error (500), set the status to error, and have the UI reflect some kind of messaging for the package that there was a problem installing/updating it, and tell them they need to install again to fix the install. Currently we tell them there was a problem installing the package, with a warning toast (should be an error toast?), but it doesn't persist as the state is only in the UI. Need to decide where all this messaging should be. If its only on the package detail page, they may not discover it. Perhaps a section on the dashboard page.

Other scenarios

  • if kibana crashes, there isn't a way of knowing that the package install/update failed, other than checking when kibana starts again looking through all the packages' statuses, and seeing that it is still installing. the package could just still be installing (they clicked install and refreshed kibana) or the install could have been interrupted unexpectedly with no error handling. perhaps we could differentiate between the two checking the time as a package should not take a very long time to install.

@ruflin @ph would like to hear your thoughts

@ph
Copy link
Contributor

ph commented Jun 26, 2020

I think we have discussed that before but just to clarify the process here, when you try to update the package v1 to v2 the following scenarioes:

Scenario 1: Happy path:

  1. v1 assets are removed?
  2. v2 assets are installed (serially or in parallel)
  3. Everything is 💚

Scenario 2: Transient error (networks) Kibana works.

  1. v1 assets are removed.
  2. v2 assets are installing (oops one of them failed)
  3. Set to state to error.
  4. Display an error to the user.

The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc.

Scenario 3: Kibana hard crash during installationg.

  1. v1 assets are removed.
  2. v2 assets are installing, kibana crash during the process.

The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc.


I think we might have a little more than just an error state.

Now, I've discussed with you that it would be nice to have an atomic operation, I wonder could we build one on top of it base on the above state machine?

  1. Could each steps had a rollback methods? (rollback dashboard, rollback template) Either they take the "previous" templates or they keep in memory the currently installed artifacts.
  2. Having the rollback would solve the problem if kibana still keep running.
  3. If kibana crash during the installing we can recover with the current state of the package?

@neptunian
Copy link
Contributor Author

I don't think your scenarios cover an unexpected server error that isn't transient, but also doesn't cause a crash (all errors in the install endpoint are caught, so it should not cause kibana to crash itself).

To clarify, the assets are not removed before new ones are installed/updated. They are created or updated with a PUT. Since data is incoming, removing all the templates or ingest pipeline would be bad. When we were discussing about deleting assets first, I was referring to the Kibana saved objects assets (dashboards, visualizations, etc).

Could each steps had a rollback methods? (rollback dashboard, rollback template) Either they take the "previous" templates or they keep in memory the currently installed artifacts.

I thought we had discussed we should not try to automatically recover but prompt the user to take some action to fix it.

I wouldn't have the previous assets in memory during an update. I would need to fetch the package from the registry and install them again. It sounds like you are thinking about having some state management that keeps track of each step in the install process and undo the changes. I think it might make sense to just attempt to reinstall the previous package if we're trying to get back to a working package as it would be less complex and be a similar process of updating assets from the previous version. Of course, this could also error for some unknown reason, and at that point we just need to let the user known of this error state.

If kibana crash during the installing we can recover with the current state of the package?

by state do you mean where it left off in the installation process? No. There is no state management, currently. If you mean whether or not it installed successfully, I tried to address that above:

if kibana crashes, there isn't a way of knowing that the package install/update failed, other than checking when kibana starts again looking through all the packages' statuses, and seeing that it is still installing. the package could just still be installing (they clicked install and refreshed kibana) or the install could have been interrupted unexpectedly with no error handling. perhaps we could differentiate between the two checking the time as a package should not take a very long time to install.

@ruflin
Copy link
Member

ruflin commented Jun 29, 2020

From my perspective there are 2 different ways things can go wrong:

  • Affects the data
  • Does not affect the data

For the part that affects the data (ingest-pipeline, mapping, template) we have a predefined order. I think this flow prevents us from bad things happening. But if the upgrade was fully applied, we CAN NOT roll back as otherwise we could have new shippers sending to an old ingest pipeline.

The second part is all the other assets. If a dashboard is not fully load, it is not great, but it can be just loaded again. I think we should optimise for rolling forward / reinstalling.

My main concerns are around leaving old assets behind and not providing good error messages to the user.

@neptunian
Copy link
Contributor Author

For the part that affects the data (ingest-pipeline, mapping, template) we have a predefined order. I think this flow prevents us from bad things happening. But if the upgrade was fully applied, we CAN NOT roll back as otherwise we could have new shippers sending to an old ingest pipeline.

Yes, I was thinking if we did want a rollback solution this particular order of events would have to be treated as one rollback step that has to occur in order starting from the first step of creating the pipelines, which makes things a bit complicated and becomes similar to reinstalling.

My main concerns are around leaving old assets behind and not providing good error messages to the user.

Most assets are going to be updated, so the assets left behind would be if the new package had added new datasets or kibana assets that don't exist in the previous version. The new datasets assets would be there unused since the agent configuration would not be able to use it. The new Kibana assets would remain. Depending how #65035 is handled, we could remove all the assets that are associated with this package based on the package name we are going to associate with the asset?

@ruflin
Copy link
Member

ruflin commented Jun 30, 2020

In any case, I think we must always know all the assets that are installed and only remove it from tracking if we really know we removed them to ensure we always cleanup after "us".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants