-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPM] handle failure cases in the package update process #64213
Comments
Pinging @elastic/ingest-management (Feature:EPM) |
Because installing is not atomic and updating is a process of overwriting assets with the same names, there is not a way to guarantee the installation is going to be successful before assets are installed/updated. "Rolling back" during an unsuccessful update would be trying to reinstall the previous package. So long as we use the same name for the assets, we can't install two versions at the same time and only use the new one when its successful. So I think our best option is to inform the user the package installation was bad and to try to install it again. This means they could be in a state with assets of two package versions until its fixed. For beta I am proposing:
Other scenarios
|
I think we have discussed that before but just to clarify the process here, when you try to update the package v1 to v2 the following scenarioes: Scenario 1: Happy path:
Scenario 2: Transient error (networks) Kibana works.
The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc. Scenario 3: Kibana hard crash during installationg.
The problems are after this the system is in an inconsistent state, dashboards are possibly missing. etc. I think we might have a little more than just an error state. Now, I've discussed with you that it would be nice to have an atomic operation, I wonder could we build one on top of it base on the above state machine?
|
I don't think your scenarios cover an unexpected server error that isn't transient, but also doesn't cause a crash (all errors in the install endpoint are caught, so it should not cause kibana to crash itself). To clarify, the assets are not removed before new ones are installed/updated. They are created or updated with a PUT. Since data is incoming, removing all the templates or ingest pipeline would be bad. When we were discussing about deleting assets first, I was referring to the Kibana saved objects assets (dashboards, visualizations, etc).
I thought we had discussed we should not try to automatically recover but prompt the user to take some action to fix it. I wouldn't have the previous assets in memory during an update. I would need to fetch the package from the registry and install them again. It sounds like you are thinking about having some state management that keeps track of each step in the install process and undo the changes. I think it might make sense to just attempt to reinstall the previous package if we're trying to get back to a working package as it would be less complex and be a similar process of updating assets from the previous version. Of course, this could also error for some unknown reason, and at that point we just need to let the user known of this error state.
by state do you mean where it left off in the installation process? No. There is no state management, currently. If you mean whether or not it installed successfully, I tried to address that above:
|
From my perspective there are 2 different ways things can go wrong:
For the part that affects the data (ingest-pipeline, mapping, template) we have a predefined order. I think this flow prevents us from bad things happening. But if the upgrade was fully applied, we CAN NOT roll back as otherwise we could have new shippers sending to an old ingest pipeline. The second part is all the other assets. If a dashboard is not fully load, it is not great, but it can be just loaded again. I think we should optimise for rolling forward / reinstalling. My main concerns are around leaving old assets behind and not providing good error messages to the user. |
Yes, I was thinking if we did want a rollback solution this particular order of events would have to be treated as one rollback step that has to occur in order starting from the first step of creating the pipelines, which makes things a bit complicated and becomes similar to reinstalling.
Most assets are going to be updated, so the assets left behind would be if the new package had added new datasets or kibana assets that don't exist in the previous version. The new datasets assets would be there unused since the agent configuration would not be able to use it. The new Kibana assets would remain. Depending how #65035 is handled, we could remove all the assets that are associated with this package based on the package name we are going to associate with the asset? |
In any case, I think we must always know all the assets that are installed and only remove it from tracking if we really know we removed them to ensure we always cleanup after "us". |
As part of #59910,
If a package fails to install for whatever reason, there needs to be a "rollback" as the install will be partial and they could be in a state where they have different assets of both versions. @ruflin said that for now, in a blog post, they’ll have to uninstall the old package and then try to install again (the new version doesn't get written to the saved object unless it was successful)
The text was updated successfully, but these errors were encountered: