-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create tests to ensure that Elastic Agent's upgrade are working as expected #2176
Comments
I think there is an issue with writing this test. Currently the built version of Elastic Agent stores that hash of the running elastic-agent, an upgrade within the same repository would result in it trying to upgrade to the same hash value. That will not work. We need a way to have upgrade use a different hash value, but possibly it is the same build, hmm... |
Reopening this issue as #2618 only covers the first test scenario mentioned in this issue:
|
There is a test listed in this issue's description that goes like so:
I've been working on this test for the past couple of weeks and have found a few bugs in the process. Fixing these bugs will unblock this test. However, speaking with @pierrehilbert today, he clarified that the intent of the test is to ensure that we have a graceful rollback process in place, not so much how the upgraded Agent fails to cause this rollback. As such, I've left this test as is but added another test to the list that is similar:
Both these tests will ensure that we have a graceful rollback process. The difference in them is in the setup of the test scenario - in the first one, the rollback should be initiated due to the Agent process crashing whereas in the second one, the rollback should be initiated due to the Agent process remaining up but reporting as unhealthy. |
@pierrehilbert @cmacknz @jlind23 Not sure who added this item in the list of tests so pinging all of you. Thanks! I'd like to get some clarity on what we mean by "current version" and "interrupted" exactly. For example, let's say the user requests an upgrade, and while the Agent is downloading the new artifact, we kill the Agent process. This would've been the old (pre-upgrade) version of the Agent that we killed. In this case, do we want the test to verify that the Agent is restarted reliably with the old version or (eventually) the new version? Similarly, let's say the new version (post-upgrade) of the Agent is running. But the Upgrade Watcher is also running. What does it mean to interrupt the upgrade in this scenario? Should the test try killing the Upgrade Watcher process and verify that it eventually restarted? Or should the test try killing the new version of the Agent (only once, otherwise the Upgrade Watcher will roll back the upgrade thinking the new Agent is unhealthy — and we already have an integration test for this scenario) and verify that the new version of the Agent is running again and does not get rolled back by the Upgrade Watcher? Perhaps we want to test all of these different scenarios? In that case, I think we should replace this one test in the list with multiple ones, one for each specific scenario. |
I'm not the one who wrote it, but I believe that in such case we want the Agent to restart reliably with the old version as we don't know if what state the new one ended up.
I lean towards that option, we should probably list all specific scenarii and prioritise the work. |
Ideally we test all three scenarios to know the result, but the most likely reason for the upgrade process to be interrupted is the host system shutting down or hibernating mid-upgrade. This means we are unlikely to see the watcher and new version of the agent be interrupted individually, it is more likely they will both be terminated together. The overarching goal with the upgrade process is eventual consistency. If the user instructed the agent to upgrade it should perform the upgrade, eventually, regardless of any disruption to any process or system involved in the upgrade. Note that "perform the upgrade" means progressing through every stage of the upgrade process, and it is valid for the end state to be a successful upgrade to a new version or roll back to the previous version if the new version does not work as expected. This starts to overlap with https://github.com/elastic/ingest-dev/issues/1889 (fault tolerant action processing scenarios), I suspect to get this to work properly we will possibly need to make changes to Fleet as well as the agent since it will be directly tied to how and when we acknowledge actions and where they are persisted. I would prefer we prioritize making the upgrade process observable before we do this type of testing, both because doing so will make these problems easier to debug and because making this work correctly under all conditions may take significant time. |
@ycombinator is this issue still required even with all the work we achieved on upgrades lately? |
@jlind23 @ycombinator The integration testing framework covers all of this, even the checkboxes that where not checked. I would say this one is done. |
Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:
Could you please point me to the tests that cover each of these scenarios? |
This one I do believe is missing. Was thinking we had it, but was unable to find it.
See above. |
Actually, this test was written to cover the following scenario:
We need to write another test that interrupts the upgrade process itself, e.g. restarting the Agent while downloading or right before switching the symlink and ensuring that the pre-upgrade Agent version is the one that comes up. I'll clarify in the task list item to remove the ambiguity. |
Actually, this test was written to cover the following scenario:
We need to write another test that deliberately builds a corrupted Agent package, tries to upgrade to it, and ensures that the upgrade is rolled back to the pre-upgrade Agent version. |
Closing in favor of individual issues for pending tests:
|
We need to create a basic integration test in the Agent repository (because we are testing Agent capability) to ensure that the upgrade are working as expected:
TestFleetManagedUpgrade
test + cleanup #2809This integration test should cover upgrade from the Agent itself and not test upgrade triggered by Fleet Server.
Upgrade versions should be configurable but by default it should be current_version to next_version.
This test is crucial: upgrade is a really impacting feature and we faced some issue in the past.
For reference the majority of the upgrade process is defined in https://github.com/elastic/elastic-agent/tree/main/internal/pkg/agent/application/upgrade.
Blocked By:
InstallChecker
incorrectly detects systemd Agent service as not installed #3188systemd
prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123The text was updated successfully, but these errors were encountered: