Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only allow x-pack metadata if all nodes are ready #30743

Merged

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 19, 2018

This PR enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata. Changes in this PR are all local to the x-pack code. It's still missing some tests, but can and should already be reviewed. I've done a backport of this PR to 6.3 and it passed the rolling upgrade tests introduced in #30728 (more of these tests will be required) from 6.2 OSS to 6.3 default distribution.

The changes to the ML, Watcher and License components were pretty straight-forward, the TokenService changes are a bit more involved, and I would like to get a pair of eyes from the @elastic/es-security team on this.

@droberts195 this PR includes the custom x-pack node attribute. I'm not sure it makes sense to have a separate PR for that as this is the PR that's actually making use of it.

Relates to #30731

@ywelsch ywelsch added >enhancement blocker v7.0.0 v6.3.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :ml Machine learning :Security/Security Security issues without another label :Data Management/Watcher v6.4.0 labels May 19, 2018
@ywelsch ywelsch requested review from s1monw, bleskes and jasontedor May 19, 2018 20:26
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

});
}
} else {
installTokenMetadataCheck.set(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why this gets set to false if the custom metadata already exists.
It seems like it could introduces an extraordinarily unlikely race condition (probably impossible in practice):

  1. Thread 1: custom metadata is null,
  2. Thread 1: set installTokenMetadataCheck to true
  3. Thread 2: custom metadata is null
  4. Thread 1: install metadata
  5. Thread 3: custom metadata is non-null
  6. Thread 3: set installTokenMetadataCheck to false
  7. Thread 2: set installTokenMetadataCheck to true
  8. Thread 2: try into install duplicate metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern has been copied from other sections of code that want to be defensive against the possibility of their cluster state changes being deleted somehow, and wanting to reinstall them. It's not really a problem if the unlikely race condition you describe occurs, as it doesn't hurt correctness to get into the execute() method twice, only performance.

The history is that there are similar bits of code in ML and Watcher that didn't have the atomic guard variable when first written. So originally they might update the cluster state 100 times with the same change as the cluster started up. After observing this happening we added the atomic guard and it stopped enough of the duplicate metadata additions that it's no longer a noticeable problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this happens inside of a cluster state listener and IIRC these are all executed on the cluster state update thread, which prevents multi-threading issues around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @droberts195 and @jaymode explained, installTokenMetadata is called by a single thread, but possibly repeatedly so while the cluster state update task has not taken effect yet. Initially I copied the code from a similar task in ML, but have made a smaller simplification in 87ad80f that should make it clearer why and when the flag is reset.

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security changes LGTM. Left a minor comment about cleaning up a method override.

} else {
return Collections.emptyMap();
}
// TODO: Remove this whole concept of InitialClusterStateCustomSupplier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove the method override here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I've pushed e1437e8

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I left some questions and a nit.


return super.additionalSettings();
} else {
if (settings.get(xpackInstalledNodeAttrSetting) != null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we allow this setting to be already set in this case? shouldn't we lock it down?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its because the internal cluster integration test framework will restart nodes with settings copied from the node immediately before it was stopped. There's a comment here for the same problem that could be copied over to prevent confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly. I wanted to fix that behavior, but did not want to make it part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment in ba94bdd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #30780 which would allow me to make this check more strict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged that PR and made the check more strict in 1b6ae1d

@@ -138,6 +152,78 @@ protected Clock getClock() {
public static LicenseService getSharedLicenseService() { return licenseService.get(); }
public static XPackLicenseState getSharedLicenseState() { return licenseState.get(); }

/**
* Checks if the cluster state allows this node to add x-pack metadata to the cluster state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document that if the cluster state already contains xpack metadata it is always considered "ready"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed 5f25ea3

@@ -57,6 +58,7 @@ protected void masterOperation(FinalizeJobExecutionAction.Request request, Clust
clusterService.submitStateUpdateTask(source, new ClusterStateUpdateTask() {
@Override
public ClusterState execute(ClusterState currentState) throws Exception {
XPackPlugin.checkReadyForXPackCustomMetadata(currentState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 this feels strange given the name of the method. If we get so far that TransportFinalizeJobExecutionAction is called, shouldn't the cluster state already have the ML type? can you please double check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the only endpoints that need protecting are the ones that put jobs and datafeeds. Until the user has created an ML entity the endpoints that operate on them should be no-ops.

(There is actually a bug in this action in that if you passed it a non-existent job ID it would currently fail with an NPE. That's not a disaster as it's an undocumented action intended for internal use, but I will make it more defensive in another PR.)

logger.debug("cannot add ML metadata to cluster as the following nodes might not understand the ML metadata: {}",
() -> XPackPlugin.nodesNotReadyForXPackCustomMetadata(event.state()));
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the changes I made in #30751 there's no need to change this file in this PR. We no longer eagerly install the ML metadata on startup.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the xpack.installed attribute. I'll update the meta issue to say it's been done in this PR.

@@ -188,6 +189,9 @@ public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegist
DEPRECATION_LOGGER.deprecated("Creating jobs with delimited data format is deprecated. Please use xcontent instead.");
}

// pre-flight check, not necessarily required, but avoids figuring this out while on the CS update thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to remove the not necessarily required bit, and make this the primary check for protecting the cluster state against ML jobs. Then the check can be removed from buildNewClusterState() at the bottom of the file.

@@ -565,6 +569,7 @@ public ClusterState execute(ClusterState currentState) {
}

private static ClusterState buildNewClusterState(ClusterState currentState, MlMetadata.Builder builder) {
XPackPlugin.checkReadyForXPackCustomMetadata(currentState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if creation of ML jobs and datafeeds is prevented elsewhere then this is not necessary. Any other updates to the ML custom cluster state imply that it already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 and I discussed this. We will keep the extra checks for now, but will investigate which ones can be dropped in a follow-up, adding more assertions to ensure we have all places covered.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions were answered. LGTM.

@ywelsch ywelsch merged commit 8145a82 into elastic:master May 23, 2018
ywelsch added a commit that referenced this pull request May 23, 2018
Enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing
x-pack code from installing custom metadata into the cluster state until all nodes are capable of
deserializing this metadata.
ywelsch added a commit that referenced this pull request May 23, 2018
Enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing
x-pack code from installing custom metadata into the cluster state until all nodes are capable of
deserializing this metadata.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 23, 2018
* master: (25 commits)
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (elastic#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (elastic#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (elastic#30780)
  Only ack cluster state updates successfully applied on all nodes (elastic#30672)
  ...
dnhatn added a commit that referenced this pull request May 24, 2018
* 6.x:
  [DOCS] Fixes typos in security settings
  Add support for indexed shape routing in geo_shape query (#30760)
  [DOCS] Splits auditing.asciidoc into smaller files
  Painless: Types Section Clean Up (#30283)
  [test] java tests for archive packaging (#30734)
  Deprecate http.pipelining setting (#30786)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  Change serialization version of doc-value fields.
  Add a `format` option to `docvalue_fields`. (#29639)
  [TEST] Don't expect acks when isolating nodes
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Revert "Add more yaml tests for get alias API (#29513)"
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Replace Request#setHeaders with addHeader (#30588)
  [TEST] remove endless wait in RestClientTests (#30776)
  QA: Add xpack tests to rolling upgrade (#30795)
  Add support for search templates to the high-level REST client. (#30473)
  Reduce CLI scripts to one-liners on Windows (#30772)
  Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561)
  Add more yaml tests for get alias API (#29513)
  [Docs] Fix script-fields snippet execution (#30693)
  Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182)
  Remove assert statements from field caps documentation. (#30601)
  Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181)
  Add support for field capabilities to the high-level REST client. (#29664)
  [DOCS] Add SAML configuration information (#30548)
  [DOCS] Remove X-Pack references from SQL CLI (#30694)
  [Docs] Fix typo in circuit breaker docs (#29659)
  [Feature] Adding a char_group tokenizer (#24186)
  Increase the maximum number of filters that may be in the cache. (#30655)
  [Docs] Fix broken cross link in documentation
  Test: wait for netty threads in a JUnit ClassRule (#30763)
  [Security] Include an empty json object in an json array when FLS filters out all fields (#30709)
  [DOCS] fixed incorrect default
  [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk
  Enable installing plugins from snapshots.elastic.co (#30765)
  Ignore empty completion input (#30713)
  Fix docs failure on language analyzers (#30722)
  [Docs] Fix inconsistencies in snapshot/restore doc (#30480)
  Add Delete Repository High Level REST API (#30666)
  Reduce CLI scripts to one-liners (#30759)
dnhatn added a commit that referenced this pull request May 24, 2018
* master:
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (#30820)
  Painless: Types Section Clean Up (#30283)
  Add support for indexed shape routing in geo_shape query (#30760)
  [test] java tests for archive packaging (#30734)
  Revert "Make http pipelining support mandatory (#30695)" (#30813)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Expose Lucene's FeatureField. (#30618)
  Fix a grammatical error in the 'search types' documentation.
  Remove http pipelining from integration test case (#30788)
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 25, 2018
* es/ccr: (55 commits)
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
  Mute CorruptedFileIT in CCR
  Add a `format` option to `docvalue_fields`. (elastic#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (elastic#30743)
  ...
ywelsch added a commit that referenced this pull request Jun 1, 2018
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes
might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the
default 6.3 distribution.

Follow-up to #30743
ywelsch added a commit that referenced this pull request Jun 1, 2018
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes
might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the
default 6.3 distribution.

Follow-up to #30743
ywelsch added a commit that referenced this pull request Jun 1, 2018
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes
might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the
default 6.3 distribution.

Follow-up to #30743
jaymode added a commit to jaymode/elasticsearch that referenced this pull request Jun 11, 2018
This commit fixes a backwards compatibility bug in the token service
that causes token decoding to fail when there is a pre 6.0.0-beta2 node
in the cluster. The token encoding is actually the culprit as a version
check is missing around the serialization of the key hash bytes. This
value was added in 6.0.0-beta2 and cannot be sent to nodes that do not
know about this value. The version check has been added and the token
service unit tests have been enhanced to randomly run with some 5.6.x
nodes in the cluster service.

Additionally, a small change was made to the way we check to see if the
token metadata needs to be installed. Previously we would pass the
metadata to the install method and check that the token metadata is
null. This null check is now done prior to checking if the metadata can
be installed.

Relates elastic#30743
Closes elastic#31195
jaymode added a commit that referenced this pull request Jun 13, 2018
This commit fixes a backwards compatibility bug in the token service
that causes token decoding to fail when there is a pre 6.0.0-beta2 node
in the cluster. The token encoding is actually the culprit as a version
check is missing around the serialization of the key hash bytes. This
value was added in 6.0.0-beta2 and cannot be sent to nodes that do not
know about this value. The version check has been added and the token
service unit tests have been enhanced to randomly run with some 5.6.x
nodes in the cluster service.

Additionally, a small change was made to the way we check to see if the
token metadata needs to be installed. Previously we would pass the
metadata to the install method and check that the token metadata is
null. This null check is now done prior to checking if the metadata can
be installed.

Relates #30743
Closes #31195
jaymode added a commit that referenced this pull request Jun 13, 2018
This commit fixes a backwards compatibility bug in the token service
that causes token decoding to fail when there is a pre 6.0.0-beta2 node
in the cluster. The token encoding is actually the culprit as a version
check is missing around the serialization of the key hash bytes. This
value was added in 6.0.0-beta2 and cannot be sent to nodes that do not
know about this value. The version check has been added and the token
service unit tests have been enhanced to randomly run with some 5.6.x
nodes in the cluster service.

Additionally, a small change was made to the way we check to see if the
token metadata needs to be installed. Previously we would pass the
metadata to the install method and check that the token metadata is
null. This null check is now done prior to checking if the metadata can
be installed.

Relates #30743
Closes #31195
ywelsch added a commit that referenced this pull request Aug 2, 2018
This infrastructure was introduced in #26144 and made obsolete in #30743
@jpountz jpountz removed :Data Management/Watcher :Security/Security Security issues without another label :ml Machine learning labels Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v6.3.0 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants