-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup check for security implicit behavior change #76879
Conversation
Pinging @elastic/es-security (Team:Security) |
cc @colings86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@albertzaharovits , @DaveCTurner can you folks please take a look ? |
server/src/main/java/org/elasticsearch/env/NodeEnvironment.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would wait for David to give his advice.
Besides it, I would expect at least some upgrade qa tests to fail, no? It's odd that they don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check logic looks good but I think it'll always get skipped when running a real node - see inline comments.
server/src/main/java/org/elasticsearch/env/NodeEnvironment.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/Security.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
public boolean alwaysEnforce() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this always enabled Bootstrap check, but this is currently the only way for us to make a check on node startup that has a view ( albeit limited ) to the restored cluster state ( via the BootstrapContext )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd forgotten that we expose the metadata read from disk like this, but I think this is fine - at least it's no worse than any of the other places that make decisions based on the contents of the on-disk cluster state despite the fact that this could be stale or even uncommitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall approach LGTM, but I have a few minor comments
@@ -68,10 +76,14 @@ public Version nodeVersion() { | |||
return nodeVersion; | |||
} | |||
|
|||
public Version previousNodeVersion() { | |||
return previousNodeVersion; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I'm fighting against the existing conventions of this class, but is it possible to get some sort of javadoc here?
What does previous
mean exactly? I think it's "last time the node started" (or more accurately "the version of the metadata that was read from disk") ... but I'm sure there could be all sorts of nuace in rolling upgrades, master elections, etc, and I'd like to be able to consult javadocs so I can know how to reason about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 80f893f, @DaveCTurner can keep me honest or suggest enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs LGTM 👍
+ "." | ||
+ Version.CURRENT.minor | ||
+ "/security-minimal-setup.html to enable security, or explicitly disable security by " | ||
+ "setting [xpack.security.enabled] to \"false\" in elasticsearch.yml before restarting the node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming in late, so maybe this has been discussed, but this message feels a bit lacking.
People who get this message don't necessarily realise why they're getting it now, and why it's a fatal error.
I think we can come up with something a bit more helpful that tells them that we've detected that this node was previously running in a configuration that did not have security, and the behaviour has changed so they need to explicitly opt in to the new or old behaviour.
I'm happy to help work on that message if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take another attempt at it, I'll ask @lockewritesdocs to weigh-in on the wording too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rephrased it, let me know what you think. Open to suggestions
|
||
public class SecurityImplicitBehaviorBootstrapCheckTests extends AbstractBootstrapCheckTestCase { | ||
|
||
public void testFailureUpgradeFrom7xWithImplicitSecuritySettings() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need 2 methods:
testFailureUpgradeFrom7xWithImplicitSecuritySettingsOnTrialOrBasic
testSuccessfulUpgradeFrom7xWithImplicitSecuritySettingsOnGoldPlus
The 2nd one seems to be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, will add now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 2acbd38
@DaveCTurner would you be able to take another look please ? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one important comment about not writing the new field to disk and one other comment. Otherwise LGTM.
public NodeMetadata(final String nodeId, final Version nodeVersion) { | ||
private final Version previousNodeVersion; | ||
|
||
public NodeMetadata(final String nodeId, final Version nodeVersion, final Version previousNodeVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could we make this private, and construct the instances needed in SecurityImplicitBehaviorBootstrapCheckTests
by calling upgradeToCurrentVersion()
instead?
@@ -68,10 +76,14 @@ public Version nodeVersion() { | |||
return nodeVersion; | |||
} | |||
|
|||
public Version previousNodeVersion() { | |||
return previousNodeVersion; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs LGTM 👍
@@ -125,6 +151,7 @@ public NodeMetadata build() { | |||
objectParser = new ObjectParser<>("node_meta_data", ignoreUnknownFields, Builder::new); | |||
objectParser.declareString(Builder::setNodeId, new ParseField(NODE_ID_KEY)); | |||
objectParser.declareInt(Builder::setNodeVersionId, new ParseField(NODE_VERSION_KEY)); | |||
objectParser.declareInt(Builder::setPreviousNodeVersionId, new ParseField(PREVIOUS_NODE_VERSION_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to write this field to disk? I think we just overwrite it before ever using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right David, we don't need to. I'm amending
} | ||
} | ||
|
||
public boolean alwaysEnforce() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd forgotten that we expose the metadata read from disk like this, but I think this is fine - at least it's no worse than any of the other places that make decisions based on the contents of the on-disk cluster state despite the fact that this could be stale or even uncommitted.
Thanks @DaveCTurner , I've addressed your comments |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jkakavas, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 with the latest changes
In the security ON by default project, we introduced a breaking change for the xpack.security.enabled setting. While we do expose necessary deprecation warnings and release notes, there might still be a case where a deployment that is - On basic or trial license with `xpack.security.enabled` not set - A single node cluster or a multi-node, single-host cluster gets upgraded to 8.x in place without using the Upgrade Assistant or consulting the release notes. In this case, we elect to stop the node from starting via a newly introduced BootstrapCheck, so that we can notify the user that the implicit behavior for security has changed. If we don't do that, the upgrade can seemingly succeed but the user will have no way to interact with the upgraded cluster as security is enabled and they have no credentials. This is a best effort check in the sense that: - LicenseState might not be correct that early in the node lifecycle, so we might not be able to know if this node was on basic/trial - A grow-and-shrink upgrade would bypass this check since new nodes start with empty state on disk - A user might change the configuration and remove the explicit xpack.security.enabled configuration _while_ upgrading the node to 8.x
In the security ON by default project, we introduced a breaking change
for the xpack.security.enabled setting. While we do expose necessary
deprecation warnings and release notes, there might still be a case
where a deployment that is
xpack.security.enabled
not setgets upgraded to 8.x in place without using the Upgrade Assistant or
consulting the release notes. In this case, we elect to stop the node
from starting so that we can notify the user that the implicit behavior
for security has changed. If we don't do that, the upgrade can
seemingly succeed but the user will have no way to interact with the
upgraded cluster as security is enabled and they have no credentials.
This is a best effort check in the sense that:
node lifecycle, so we might not be able to know if this node was on
basic/trial
nodes start with empty state on disk
xpack.security.enabled configuration while upgrading the node
to 8.x