-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster local health call to throw exception if node is decommissioned or weighed away #6198
Cluster local health call to throw exception if node is decommissioned or weighed away #6198
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
c66aba1
to
024ff6a
Compare
Gradle Check (Jenkins) Run Completed with:
|
@@ -135,7 +135,7 @@ public void writeTo(StreamOutput out) throws IOException { | |||
out.writeEnum(level); | |||
} | |||
if (out.getVersion().onOrAfter(Version.V_2_6_0)) { | |||
out.writeBoolean(ensureNodeCommissioned); | |||
out.writeBoolean(ensureNodeWeighedIn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause BWC failures? If yes we can retain both flags and deprecate the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 2.6.0
is not released yet, do you this BWC is required? I am looking out for test failures in gradle build, will that be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that when node is decommissioned the weights are set to zero
clusterHealthRequest.ensureNodeCommissioned(false); | ||
clusterHealthRequest.ensureNodeWeighedIn(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add an integ tests to cover all combinations of healthy/unhealthy/weighed in and decommissioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Decommission test is captured here
Line 308 in 9dd3632
// Check cluster health API for decommissioned and active node |
There are validations in place and decommission fail in case weight is not zero https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/cluster/decommission/DecommissionService.java#L458 |
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6198 +/- ##
============================================
- Coverage 70.90% 70.84% -0.06%
+ Complexity 58904 58899 -5
============================================
Files 4778 4780 +2
Lines 281149 281166 +17
Branches 40622 40625 +3
============================================
- Hits 199346 199195 -151
- Misses 65480 65686 +206
+ Partials 16323 16285 -38
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
listener.onResponse(getResponse(request, currentState, waitCount, TimeoutState.OK)); | ||
ClusterHealthResponse clusterHealthResponse = getResponse(request, currentState, waitCount, TimeoutState.OK); | ||
if (request.ensureNodeWeighedIn() && clusterHealthResponse.hasDiscoveredClusterManager()) { | ||
DiscoveryNode localNode = clusterService.state().getNodes().getLocalNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assert that local node is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Gradle Check (Jenkins) Run Completed with:
|
...r/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
Outdated
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
4abfd6e
to
3ee469b
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
…d or weighed away Signed-off-by: Anshu Agarwal <[email protected]>
3ee469b
to
c6a7aa0
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
555fd27
to
6300a4e
Compare
Gradle Check (Jenkins) Run Completed with:
|
listener.onResponse(getResponse(request, currentState, waitCount, TimeoutState.OK)); | ||
ClusterHealthResponse clusterHealthResponse = getResponse(request, currentState, waitCount, TimeoutState.OK); | ||
if (request.ensureNodeWeighedIn() && clusterHealthResponse.hasDiscoveredClusterManager()) { | ||
DiscoveryNode localNode = clusterService.state().getNodes().getLocalNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we picking up the new state of the cluster and not the one where the observer actually succeeded. IMO, this should be currentState.getNodes().getLocalNode()
. Because, what if something has changed in the state and entire health action was ran on a previous state. This might give misleading results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't understand this completely. Do you mean clusterService.state()
may provide new uncommitted state? If that is case, I understand healthcheck wait for cluster state change events to complete and cluster service also has committed cluster state. Let me know if I am missing anything?
* @return true if the node has attribute value with shard routing weight set to zero, else false | ||
*/ | ||
public static boolean isWeighedAway(String nodeId, ClusterState clusterState) { | ||
DiscoveryNode node = clusterState.nodes().get(nodeId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the node has dropped during this check? Line 45 will throw NullPointerException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to put a null check. I think in this case we need to consider node as weigh away since we can't explicitly check for this. What are your thoughts?
/** | ||
* This function checks if the node is weighed away ie weighted routing weight is set to 0, | ||
* | ||
* @param nodeId the node | ||
* @return true if the node has attribute value with shard routing weight set to zero, else false | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, we would need this check in the method below -
if (node.getRoles().contains(DiscoveryNodeRole.DATA_ROLE) == false) {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather -
if (node == null || node.getRoles().contains(DiscoveryNodeRole.DATA_ROLE) == false) {
return false;
}
Signed-off-by: Anshu Agarwal [email protected]
Description
This PR adds a param to cluster health local call to check if a node is commissioned or weighed in.
Example Request/Response-
Weights are set using PUT api-
For a node that is not weighed away ie weighting routing weight is not 0
For a node in zone-3 with weighted routing weight set to 0
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.