-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering from 8.0 upgrade failure due to unmigrated 6.x indices #81326
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Yes, I agree that this should be a blocker. I believe our best option is for 8.x installations of Elasticsearch to check for 6.x indices as early as possible, before making any irreversible changes. |
Agreed that this should be a blocker for 8.0. The initial ticket only discusses system indices, but we don't handle system indices in any special way at this stage of startup - I would expect to see the same thing if we tried to upgrade a cluster with un-upgraded regular (non-system) indices as well. Have we tried that? Just trying to pin down the scope of this issue. |
I looked at the code. The keystore can be upgraded from the command line with So it seems we need to delay upgrading the keystore but... how long? Do we have to wait until the node has loaded whatever index information it has on disk? Or do we need to wait until the node is part of a healthy cluster? The difficulty from the security perspective is that we are trying to keep all the code that needs the keystore password in one place so that we can avoid holding it in memory for a long time. |
I'm not sure we can delay the rewrite of the keystore - as Ryan notes here we want to do this before installing the security manager because -- apart from at startup -- we don't want anything to be able to write to it. However I don't think that's a big deal, users can re-create the keystore from scratch if needed with the tools that they have today. The big problem is that we effectively run |
EDIT: This approach has been superseded. We talked about this in today's Core/Infra team meeting. At a high level, there are three potential approaches in 8.x:
(1) and (3) would involve massive changes to the codebase and an extremely large risk of serious bugs, so by process of elimination we are left with 2. We expect this issue to be rare. In order to encounter it, the user would have to have a 7.x cluster that contains indices created in 6.x, ignored every warning in our documentation and deprecation endpoints, and upgraded in specific circumstances. For example, in a large cluster with good replication and a rolling upgrade strategy, one node might get into a bad state, but it can be discarded and recreated without much trouble. However, in a smaller cluster or a cluster that is upgraded with a full restart, discarding and recreating may not be an option. The most difficult part of solving this problem will probably be creating "n-2" upgrade tests, which create a 6.x cluster and add test data, then upgrade to 7.x, and finally test different upgrade scenarios to 8.x. Right now we only test upgrades going back one version. A proposed task list:
|
#81865 will begin to enforce that every upgrade to 8.x goes through 7.17.x. This should make the "unmigrated indices" failure even rarer. It could happen in two cases:
It is possible that we could now address these failures by building the rollback logic into 7.17.x, so that a failed upgrade to 8.x could always be addressed by installing 7.17.x over whatever version failed to upgrade. |
We have decided on a different approach to this problem, which will also solve #81865 . There is a PR for part of it here: #82321
|
The Stack Management team removed the system index migration feature from Upgrade Assistant in 7.16 (elastic/kibana#119798). During upgrade testing, @LeeDr discovered that users who upgrade from 6.8 to 7.16 will use Upgrade Assistant to prepare their deployment for upgrade, skipping the system indices migration step. It's reasonable to assume that some portion of these users might believe that they can upgrade to 8.0 at this point, unaware that we expect them to upgrade to 7.17 first.
After upgrading to 8.0, Elasticsearch will fail to start for these users because of the unmigrated 6.8 system indices. This startup failure occurs after ES has updated the keystore to be 8.x compatible. A user who attempts to fix this problem by downgrading to 7.16 will be blocked from doing so due to keystore incompatibility with 7.x (this is my assumption and needs verification). The user is now stuck, unable to complete their upgrade.
We have many users still on 6.x and prior, increasing the risk of users encountering this scenario. @DaveCTurner suggested we file this issue as an 8.0 blocker.
Some solutions suggested by David:
CC @elastic/kibana-stack-management
The text was updated successfully, but these errors were encountered: