Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup check for security implicit behavior change #76879

Merged
merged 20 commits into from
Oct 25, 2021

Conversation

jkakavas
Copy link
Member

In the security ON by default project, we introduced a breaking change
for the xpack.security.enabled setting. While we do expose necessary
deprecation warnings and release notes, there might still be a case
where a deployment that is

  • On basic or trial license with xpack.security.enabled not set
  • A single node cluster or a multi-node, single-host cluster

gets upgraded to 8.x in place without using the Upgrade Assistant or
consulting the release notes. In this case, we elect to stop the node
from starting so that we can notify the user that the implicit behavior
for security has changed. If we don't do that, the upgrade can
seemingly succeed but the user will have no way to interact with the
upgraded cluster as security is enabled and they have no credentials.

This is a best effort check in the sense that:

  • LicenseState might not be available and/or correct that early in the
    node lifecycle, so we might not be able to know if this node was on
    basic/trial
  • A grow-and-shrink upgrade would bypass this check since new
    nodes start with empty state on disk
  • A user might change the configuration and remove the explicit
    xpack.security.enabled configuration while upgrading the node
    to 8.x

@jkakavas jkakavas added >enhancement :Security/Security Security issues without another label v8.0.0 labels Aug 24, 2021
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Aug 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@jkakavas
Copy link
Member Author

cc @colings86

Copy link
Contributor

@BigPandaToo BigPandaToo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jkakavas
Copy link
Member Author

jkakavas commented Sep 1, 2021

@albertzaharovits , @DaveCTurner can you folks please take a look ?

Copy link
Contributor

@albertzaharovits albertzaharovits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait for David to give his advice.

Besides it, I would expect at least some upgrade qa tests to fail, no? It's odd that they don't.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check logic looks good but I think it'll always get skipped when running a real node - see inline comments.

}
}

public boolean alwaysEnforce() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this always enabled Bootstrap check, but this is currently the only way for us to make a check on node startup that has a view ( albeit limited ) to the restored cluster state ( via the BootstrapContext )

Copy link
Contributor

@DaveCTurner DaveCTurner Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd forgotten that we expose the metadata read from disk like this, but I think this is fine - at least it's no worse than any of the other places that make decisions based on the contents of the on-disk cluster state despite the fact that this could be stale or even uncommitted.

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach LGTM, but I have a few minor comments

@@ -68,10 +76,14 @@ public Version nodeVersion() {
return nodeVersion;
}

public Version previousNodeVersion() {
return previousNodeVersion;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm fighting against the existing conventions of this class, but is it possible to get some sort of javadoc here?
What does previous mean exactly? I think it's "last time the node started" (or more accurately "the version of the metadata that was read from disk") ... but I'm sure there could be all sorts of nuace in rolling upgrades, master elections, etc, and I'd like to be able to consult javadocs so I can know how to reason about that.

Copy link
Member Author

@jkakavas jkakavas Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 80f893f, @DaveCTurner can keep me honest or suggest enhancements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM 👍

+ "."
+ Version.CURRENT.minor
+ "/security-minimal-setup.html to enable security, or explicitly disable security by "
+ "setting [xpack.security.enabled] to \"false\" in elasticsearch.yml before restarting the node"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming in late, so maybe this has been discussed, but this message feels a bit lacking.

People who get this message don't necessarily realise why they're getting it now, and why it's a fatal error.
I think we can come up with something a bit more helpful that tells them that we've detected that this node was previously running in a configuration that did not have security, and the behaviour has changed so they need to explicitly opt in to the new or old behaviour.
I'm happy to help work on that message if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take another attempt at it, I'll ask @lockewritesdocs to weigh-in on the wording too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rephrased it, let me know what you think. Open to suggestions


public class SecurityImplicitBehaviorBootstrapCheckTests extends AbstractBootstrapCheckTestCase {

public void testFailureUpgradeFrom7xWithImplicitSecuritySettings() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need 2 methods:

  • testFailureUpgradeFrom7xWithImplicitSecuritySettingsOnTrialOrBasic
  • testSuccessfulUpgradeFrom7xWithImplicitSecuritySettingsOnGoldPlus

The 2nd one seems to be missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will add now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 2acbd38

@jkakavas jkakavas requested a review from tvernum October 25, 2021 07:54
@jkakavas
Copy link
Member Author

@DaveCTurner would you be able to take another look please ? 🙏

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one important comment about not writing the new field to disk and one other comment. Otherwise LGTM.

public NodeMetadata(final String nodeId, final Version nodeVersion) {
private final Version previousNodeVersion;

public NodeMetadata(final String nodeId, final Version nodeVersion, final Version previousNodeVersion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we make this private, and construct the instances needed in SecurityImplicitBehaviorBootstrapCheckTests by calling upgradeToCurrentVersion() instead?

@@ -68,10 +76,14 @@ public Version nodeVersion() {
return nodeVersion;
}

public Version previousNodeVersion() {
return previousNodeVersion;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM 👍

@@ -125,6 +151,7 @@ public NodeMetadata build() {
objectParser = new ObjectParser<>("node_meta_data", ignoreUnknownFields, Builder::new);
objectParser.declareString(Builder::setNodeId, new ParseField(NODE_ID_KEY));
objectParser.declareInt(Builder::setNodeVersionId, new ParseField(NODE_VERSION_KEY));
objectParser.declareInt(Builder::setPreviousNodeVersionId, new ParseField(PREVIOUS_NODE_VERSION_KEY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to write this field to disk? I think we just overwrite it before ever using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right David, we don't need to. I'm amending

}
}

public boolean alwaysEnforce() {
Copy link
Contributor

@DaveCTurner DaveCTurner Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd forgotten that we expose the metadata read from disk like this, but I think this is fine - at least it's no worse than any of the other places that make decisions based on the contents of the on-disk cluster state despite the fact that this could be stale or even uncommitted.

@jkakavas
Copy link
Member Author

Thanks @DaveCTurner , I've addressed your comments

@jkakavas jkakavas requested a review from DaveCTurner October 25, 2021 16:47
@jkakavas
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jkakavas, LGTM

Copy link
Contributor

@lockewritesdocs lockewritesdocs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 with the latest changes

@jkakavas jkakavas merged commit 8b3a615 into elastic:master Oct 25, 2021
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this pull request Oct 28, 2021
In the security ON by default project, we introduced a breaking change 
for the xpack.security.enabled setting. While we do expose necessary
deprecation warnings and release notes, there might still be a case
where a deployment that is

- On basic or trial license with `xpack.security.enabled` not set
- A single node cluster or a multi-node, single-host cluster

gets upgraded to 8.x in place without using the Upgrade Assistant or 
consulting the release notes. In this case, we elect to stop the node
from starting via a newly introduced BootstrapCheck, so that we can
 notify the user that the implicit behavior for security has changed. 
If we don't do that, the upgrade can seemingly succeed but the user 
will have no way to interact with the upgraded cluster as security is 
enabled and they have no credentials. 

This is a best effort check in the sense that: 

- LicenseState might not be correct that early in the 
node lifecycle, so we might not be able to know if this node was on
basic/trial
- A grow-and-shrink upgrade would bypass this check since new
nodes start with empty state on disk
- A user might change the configuration and remove the explicit
xpack.security.enabled configuration _while_ upgrading the node
to 8.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Security Security issues without another label Team:Security Meta label for security team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants