-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only allow x-pack metadata if all nodes are ready #30743
Only allow x-pack metadata if all nodes are ready #30743
Conversation
Pinging @elastic/es-distributed |
Pinging @elastic/es-core-infra |
Pinging @elastic/ml-core |
}); | ||
} | ||
} else { | ||
installTokenMetadataCheck.set(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why this gets set to false
if the custom metadata already exists.
It seems like it could introduces an extraordinarily unlikely race condition (probably impossible in practice):
Thread 1
: custom metadata is null,Thread 1
: set installTokenMetadataCheck totrue
Thread 2
: custom metadata is nullThread 1
: install metadataThread 3
: custom metadata is non-nullThread 3
: set installTokenMetadataCheck to falseThread 2
: set installTokenMetadataCheck to trueThread 2
: try into install duplicate metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern has been copied from other sections of code that want to be defensive against the possibility of their cluster state changes being deleted somehow, and wanting to reinstall them. It's not really a problem if the unlikely race condition you describe occurs, as it doesn't hurt correctness to get into the execute()
method twice, only performance.
The history is that there are similar bits of code in ML and Watcher that didn't have the atomic guard variable when first written. So originally they might update the cluster state 100 times with the same change as the cluster started up. After observing this happening we added the atomic guard and it stopped enough of the duplicate metadata additions that it's no longer a noticeable problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this happens inside of a cluster state listener and IIRC these are all executed on the cluster state update thread, which prevents multi-threading issues around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @droberts195 and @jaymode explained, installTokenMetadata
is called by a single thread, but possibly repeatedly so while the cluster state update task has not taken effect yet. Initially I copied the code from a similar task in ML, but have made a smaller simplification in 87ad80f that should make it clearer why and when the flag is reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The security changes LGTM. Left a minor comment about cleaning up a method override.
} else { | ||
return Collections.emptyMap(); | ||
} | ||
// TODO: Remove this whole concept of InitialClusterStateCustomSupplier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just remove the method override here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I've pushed e1437e8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I left some questions and a nit.
|
||
return super.additionalSettings(); | ||
} else { | ||
if (settings.get(xpackInstalledNodeAttrSetting) != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we allow this setting to be already set in this case? shouldn't we lock it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its because the internal cluster integration test framework will restart nodes with settings copied from the node immediately before it was stopped. There's a comment here for the same problem that could be copied over to prevent confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. I wanted to fix that behavior, but did not want to make it part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Makes sense. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment in ba94bdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #30780 which would allow me to make this check more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged that PR and made the check more strict in 1b6ae1d
@@ -138,6 +152,78 @@ protected Clock getClock() { | |||
public static LicenseService getSharedLicenseService() { return licenseService.get(); } | |||
public static XPackLicenseState getSharedLicenseState() { return licenseState.get(); } | |||
|
|||
/** | |||
* Checks if the cluster state allows this node to add x-pack metadata to the cluster state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document that if the cluster state already contains xpack metadata it is always considered "ready"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed 5f25ea3
@@ -57,6 +58,7 @@ protected void masterOperation(FinalizeJobExecutionAction.Request request, Clust | |||
clusterService.submitStateUpdateTask(source, new ClusterStateUpdateTask() { | |||
@Override | |||
public ClusterState execute(ClusterState currentState) throws Exception { | |||
XPackPlugin.checkReadyForXPackCustomMetadata(currentState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 this feels strange given the name of the method. If we get so far that TransportFinalizeJobExecutionAction
is called, shouldn't the cluster state already have the ML type? can you please double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the only endpoints that need protecting are the ones that put jobs and datafeeds. Until the user has created an ML entity the endpoints that operate on them should be no-ops.
(There is actually a bug in this action in that if you passed it a non-existent job ID it would currently fail with an NPE. That's not a disaster as it's an undocumented action intended for internal use, but I will make it more defensive in another PR.)
logger.debug("cannot add ML metadata to cluster as the following nodes might not understand the ML metadata: {}", | ||
() -> XPackPlugin.nodesNotReadyForXPackCustomMetadata(event.state())); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the changes I made in #30751 there's no need to change this file in this PR. We no longer eagerly install the ML metadata on startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the xpack.installed
attribute. I'll update the meta issue to say it's been done in this PR.
@@ -188,6 +189,9 @@ public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegist | |||
DEPRECATION_LOGGER.deprecated("Creating jobs with delimited data format is deprecated. Please use xcontent instead."); | |||
} | |||
|
|||
// pre-flight check, not necessarily required, but avoids figuring this out while on the CS update thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to remove the not necessarily required
bit, and make this the primary check for protecting the cluster state against ML jobs. Then the check can be removed from buildNewClusterState()
at the bottom of the file.
@@ -565,6 +569,7 @@ public ClusterState execute(ClusterState currentState) { | |||
} | |||
|
|||
private static ClusterState buildNewClusterState(ClusterState currentState, MlMetadata.Builder builder) { | |||
XPackPlugin.checkReadyForXPackCustomMetadata(currentState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if creation of ML jobs and datafeeds is prevented elsewhere then this is not necessary. Any other updates to the ML custom cluster state imply that it already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 and I discussed this. We will keep the extra checks for now, but will investigate which ones can be dropped in a follow-up, adding more assertions to ensure we have all places covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My questions were answered. LGTM.
Enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata.
Enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata.
* master: (25 commits) [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (elastic#30780) Only ack cluster state updates successfully applied on all nodes (elastic#30672) ...
* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)
* master: [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (#30820) Painless: Types Section Clean Up (#30283) Add support for indexed shape routing in geo_shape query (#30760) [test] java tests for archive packaging (#30734) Revert "Make http pipelining support mandatory (#30695)" (#30813) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Expose Lucene's FeatureField. (#30618) Fix a grammatical error in the 'search types' documentation. Remove http pipelining from integration test case (#30788)
* es/ccr: (55 commits) [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Mute CorruptedFileIT in CCR Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) ...
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the default 6.3 distribution. Follow-up to #30743
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the default 6.3 distribution. Follow-up to #30743
Otherwise we could end up with persistent tasks metadata in the cluster that some of the nodes might not understand in case where the cluster is during rolling upgrade from the default 6.2 to the default 6.3 distribution. Follow-up to #30743
This commit fixes a backwards compatibility bug in the token service that causes token decoding to fail when there is a pre 6.0.0-beta2 node in the cluster. The token encoding is actually the culprit as a version check is missing around the serialization of the key hash bytes. This value was added in 6.0.0-beta2 and cannot be sent to nodes that do not know about this value. The version check has been added and the token service unit tests have been enhanced to randomly run with some 5.6.x nodes in the cluster service. Additionally, a small change was made to the way we check to see if the token metadata needs to be installed. Previously we would pass the metadata to the install method and check that the token metadata is null. This null check is now done prior to checking if the metadata can be installed. Relates elastic#30743 Closes elastic#31195
This commit fixes a backwards compatibility bug in the token service that causes token decoding to fail when there is a pre 6.0.0-beta2 node in the cluster. The token encoding is actually the culprit as a version check is missing around the serialization of the key hash bytes. This value was added in 6.0.0-beta2 and cannot be sent to nodes that do not know about this value. The version check has been added and the token service unit tests have been enhanced to randomly run with some 5.6.x nodes in the cluster service. Additionally, a small change was made to the way we check to see if the token metadata needs to be installed. Previously we would pass the metadata to the install method and check that the token metadata is null. This null check is now done prior to checking if the metadata can be installed. Relates #30743 Closes #31195
This commit fixes a backwards compatibility bug in the token service that causes token decoding to fail when there is a pre 6.0.0-beta2 node in the cluster. The token encoding is actually the culprit as a version check is missing around the serialization of the key hash bytes. This value was added in 6.0.0-beta2 and cannot be sent to nodes that do not know about this value. The version check has been added and the token service unit tests have been enhanced to randomly run with some 5.6.x nodes in the cluster service. Additionally, a small change was made to the way we check to see if the token metadata needs to be installed. Previously we would pass the metadata to the install method and check that the token metadata is null. This null check is now done prior to checking if the metadata can be installed. Relates #30743 Closes #31195
This PR enables a rolling restart from the OSS distribution to the x-pack based distribution by preventing x-pack code from installing custom metadata into the cluster state until all nodes are capable of deserializing this metadata. Changes in this PR are all local to the x-pack code. It's still missing some tests, but can and should already be reviewed. I've done a backport of this PR to 6.3 and it passed the rolling upgrade tests introduced in #30728 (more of these tests will be required) from 6.2 OSS to 6.3 default distribution.
The changes to the ML, Watcher and License components were pretty straight-forward, the
TokenService
changes are a bit more involved, and I would like to get a pair of eyes from the @elastic/es-security team on this.@droberts195 this PR includes the custom x-pack node attribute. I'm not sure it makes sense to have a separate PR for that as this is the PR that's actually making use of it.
Relates to #30731