Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent downgrades from 8.x to 7.x #78586

Merged

Conversation

DaveCTurner
Copy link
Contributor

Users sometimes attempt to downgrade a node in place, but downgrades are
totally untested and unsupported and generally don't work. We protect
against this by recording the node version in the metadata and refusing
to start if we encounter metadata written by a future version.

However in 8.0 (#42489) we changed the directory layout so that a 7.x
node won't find the upgraded metadata, or indeed any other data, and
will proceed as if it's a fresh node. That's almost certainly not what
the user wants, so with this commit we create a file at
${path.data}/nodes at each startup, preventing an older node from
starting.

Closes #52414

Users sometimes attempt to downgrade a node in place, but downgrades are
totally untested and unsupported and generally don't work. We protect
against this by recording the node version in the metadata and refusing
to start if we encounter metadata written by a future version.

However in 8.0 (elastic#42489) we changed the directory layout so that a 7.x
node won't find the upgraded metadata, or indeed any other data, and
will proceed as if it's a fresh node. That's almost certainly not what
the user wants, so with this commit we create a file at
`${path.data}/nodes` at each startup, preventing an older node from
starting.

Closes elastic#52414
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 labels Oct 2, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening this for discussion, but won't merge it before the relevant bits of #78525 are done since it'll just add more conflicts.

This approach is super-simple but doesn't generate the nicest message on an attempted downgrade:

[2021-10-02T14:48:42,219][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-0] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: FileSystemException[/Users/davidturner/src/elasticsearch-master/elasticsearch-7.15.0/data-0/nodes/0: Not a directory];
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:171) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:158) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:75) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:114) ~[elasticsearch-cli-7.15.0.jar:7.15.0]
	at org.elasticsearch.cli.Command.main(Command.java:79) ~[elasticsearch-cli-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:123) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:81) ~[elasticsearch-7.15.0.jar:7.15.0]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
	at org.elasticsearch.node.Node.<init>(Node.java:825) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.node.Node.<init>(Node.java:288) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:219) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:219) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:399) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:167) ~[elasticsearch-7.15.0.jar:7.15.0]
	... 6 more
Caused by: java.nio.file.FileSystemException: /Users/davidturner/src/elasticsearch-master/elasticsearch-7.15.0/data-0/nodes/0: Not a directory
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:396) ~[?:?]
	at java.nio.file.Files.createDirectory(Files.java:694) ~[?:?]
	at java.nio.file.Files.createAndCheckIsDirectory(Files.java:801) ~[?:?]
	at java.nio.file.Files.createDirectories(Files.java:787) ~[?:?]
	at org.elasticsearch.env.NodeEnvironment.lambda$new$0(NodeEnvironment.java:265) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:202) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:262) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.node.Node.<init>(Node.java:383) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.node.Node.<init>(Node.java:288) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:219) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:219) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:399) ~[elasticsearch-7.15.0.jar:7.15.0]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:167) ~[elasticsearch-7.15.0.jar:7.15.0]
	... 6 more

I also experimented with leaving something that looks like node metadata under ${path.data}/nodes/0/_state which would yield a message indicating more clearly that the problem was a downgrade, but this messes with the logic that detects whether an upgrade is needed and I'd rather keep that logic as simple as possible.

It also doesn't address the question of what to do if we reorganise the data path and then discover that the cluster metadata is somehow invalid and requires a downgrade to fix it. A downgrade in that situation is technically ok since we fail before writing the cluster metadata in a new version, but won't work since we reorganise the data path before loading the cluster metadata for the first time.

Not being able to read the older metadata is clearly a bug but I don't think we've seen any such bugs for quite a while, and it would be much messier to reorganise the data path later on in node startup, so IMO we should just accept this. If we do encounter a bug that requires a downgrade then the user can recover by manually moving the data back.

Also NB we delete the nodes directory and then create the file in its place; if we crashed in between those two actions then a downgrade would not be blocked and wouldn't find any data. IMO we should also just accept this.

/**
* Converts an array of Strings to an array of Paths, adding an additional child if specified
*/
private Path[] stringsToPaths(String[] strings, String additional) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but unused.

@@ -408,12 +419,11 @@ private static boolean upgradeLegacyNodeFolders(Logger logger, Settings settings
IOUtils.fsync(nodePath.path, true);
});

// now do the actual upgrade. start by upgrading the node metadata file before moving anything, since a downgrade in an
// intermediate state would be pretty disastrous
loadNodeMetadata(settings, logger, legacyNodeLock.getNodePath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually do anything any more, the node metadata is in the persisted cluster state which doesn't get written until much later.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

// versions 7.x and earlier put their data under ${path.data}/nodes/; leave a file at that location to prevent downgrades
final Path legacyNodesPath = environment.dataFile().resolve("nodes");
if (Files.isRegularFile(legacyNodesPath) == false) {
final String content = "written by Elasticsearch v" + Version.CURRENT +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should follow-up with a 7.16 change to log the first 256 bytes of the file if nodes is a file and not a directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh yeah I suppose we could at least help 7.16 users here :)

@DaveCTurner DaveCTurner merged commit 02b7b17 into elastic:master Oct 4, 2021
@DaveCTurner DaveCTurner deleted the 2021-10-02-prevent-downgrade-to-7x branch October 4, 2021 18:29
@DaveCTurner
Copy link
Contributor Author

Bah sorry I forgot that I was holding off on merging this until the relevant bits of #78525 are done. I'll revert and reopen.

DaveCTurner added a commit that referenced this pull request Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New directory layout does not play nicely with upgrades/downgrades
4 participants