Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster local health call to throw exception if node is decommissioned or weighed away #6198

Merged
merged 2 commits into from
Feb 8, 2023

Conversation

anshu1106
Copy link
Contributor

@anshu1106 anshu1106 commented Feb 6, 2023

Signed-off-by: Anshu Agarwal [email protected]

Description

This PR adds a param to cluster health local call to check if a node is commissioned or weighed in.

Example Request/Response-

Weights are set using PUT api-

curl -X PUT "localhost:9200/_cluster/routing/awareness/zone/weights" -H 'Content-Type: application/json' -d '{"weights":{"zone-1" : "2", "zone-2":"1", "zone-3":"0"}, "_version":-1}'
{"acknowledged":true}%
curl "localhost:9200/_cluster/routing/awareness/zone/weights"
{"weights":{"zone-3":"0.0","zone-1":"2.0","zone-2":"1.0"},"_version":0,"discovered_cluster_manager":true}%

For a node that is not weighed away ie weighting routing weight is not 0

curl "localhost:9200/_cluster/health?pretty&local&ensure_node_weighed_in"
{
  "cluster_name" : "runTask",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

For a node in zone-3 with weighted routing weight set to 0

curl "localhost:9202/_cluster/health?pretty&local&ensure_node_weighed_in"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "node_weighed_away_exception",
        "reason" : "local node is weighed away"
      }
    ],
    "type" : "node_weighed_away_exception",
    "reason" : "local node is weighed away"
  },
  "status" : 421
}

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@anshu1106 anshu1106 changed the title Cluster health call to throw exception for local weighed away node Cluster local health call to throw exception node is decommissioned or weighed away Feb 6, 2023
@anshu1106 anshu1106 changed the title Cluster local health call to throw exception node is decommissioned or weighed away Cluster local health call to throw exception if node is decommissioned or weighed away Feb 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

Gradle Check (Jenkins) Run Completed with:

@@ -135,7 +135,7 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeEnum(level);
}
if (out.getVersion().onOrAfter(Version.V_2_6_0)) {
out.writeBoolean(ensureNodeCommissioned);
out.writeBoolean(ensureNodeWeighedIn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause BWC failures? If yes we can retain both flags and deprecate the former

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 2.6.0 is not released yet, do you this BWC is required? I am looking out for test failures in gradle build, will that be sufficient?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that when node is decommissioned the weights are set to zero

clusterHealthRequest.ensureNodeCommissioned(false);
clusterHealthRequest.ensureNodeWeighedIn(false);
Copy link
Collaborator

@Bukhtawar Bukhtawar Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add an integ tests to cover all combinations of healthy/unhealthy/weighed in and decommissioned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Decommission test is captured here

@anshu1106
Copy link
Contributor Author

Can we assert that when node is decommissioned the weights are set to zero

There are validations in place and decommission fail in case weight is not zero https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/cluster/decommission/DecommissionService.java#L458

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.testReadRangeBlobWithRetries

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2023

Codecov Report

Merging #6198 (3ee469b) into main (54ce423) will decrease coverage by 0.06%.
The diff coverage is 54.54%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6198      +/-   ##
============================================
- Coverage     70.90%   70.84%   -0.06%     
+ Complexity    58904    58899       -5     
============================================
  Files          4778     4780       +2     
  Lines        281149   281166      +17     
  Branches      40622    40625       +3     
============================================
- Hits         199346   199195     -151     
- Misses        65480    65686     +206     
+ Partials      16323    16285      -38     
Impacted Files Coverage Δ
...in/cluster/health/ClusterHealthRequestBuilder.java 27.77% <0.00%> (ø)
...arch/cluster/routing/NodeWeighedAwayException.java 0.00% <0.00%> (ø)
...n/cluster/health/TransportClusterHealthAction.java 44.22% <20.00%> (-1.05%) ⬇️
...arch/cluster/routing/FailAwareWeightedRouting.java 77.77% <50.00%> (-0.23%) ⬇️
...ensearch/cluster/routing/WeightedRoutingUtils.java 73.33% <73.33%> (ø)
...ion/admin/cluster/health/ClusterHealthRequest.java 80.16% <85.71%> (+0.82%) ⬆️
.../main/java/org/opensearch/OpenSearchException.java 92.41% <100.00%> (+0.01%) ⬆️
.../src/main/java/org/opensearch/rest/RestStatus.java 89.18% <100.00%> (+0.14%) ⬆️
.../action/admin/cluster/RestClusterHealthAction.java 65.85% <100.00%> (ø)
...adonly/AddIndexBlockClusterStateUpdateRequest.java 0.00% <0.00%> (-75.00%) ⬇️
... and 494 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

Gradle Check (Jenkins) Run Completed with:

Comment on lines -277 to +283
listener.onResponse(getResponse(request, currentState, waitCount, TimeoutState.OK));
ClusterHealthResponse clusterHealthResponse = getResponse(request, currentState, waitCount, TimeoutState.OK);
if (request.ensureNodeWeighedIn() && clusterHealthResponse.hasDiscoveredClusterManager()) {
DiscoveryNode localNode = clusterService.state().getNodes().getLocalNode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that local node is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@anshu1106 anshu1106 force-pushed the healthcheck-weighed-in branch from 4abfd6e to 3ee469b Compare February 8, 2023 06:56
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=nodes.stats/11_indices_metrics/Metric - _all}

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@anshu1106 anshu1106 force-pushed the healthcheck-weighed-in branch from 3ee469b to c6a7aa0 Compare February 8, 2023 11:05
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@anshu1106 anshu1106 force-pushed the healthcheck-weighed-in branch from 555fd27 to 6300a4e Compare February 8, 2023 13:40
@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2023

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit 1eae021 into opensearch-project:main Feb 8, 2023
listener.onResponse(getResponse(request, currentState, waitCount, TimeoutState.OK));
ClusterHealthResponse clusterHealthResponse = getResponse(request, currentState, waitCount, TimeoutState.OK);
if (request.ensureNodeWeighedIn() && clusterHealthResponse.hasDiscoveredClusterManager()) {
DiscoveryNode localNode = clusterService.state().getNodes().getLocalNode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we picking up the new state of the cluster and not the one where the observer actually succeeded. IMO, this should be currentState.getNodes().getLocalNode(). Because, what if something has changed in the state and entire health action was ran on a previous state. This might give misleading results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't understand this completely. Do you mean clusterService.state() may provide new uncommitted state? If that is case, I understand healthcheck wait for cluster state change events to complete and cluster service also has committed cluster state. Let me know if I am missing anything?

* @return true if the node has attribute value with shard routing weight set to zero, else false
*/
public static boolean isWeighedAway(String nodeId, ClusterState clusterState) {
DiscoveryNode node = clusterState.nodes().get(nodeId);
Copy link
Member

@imRishN imRishN Feb 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the node has dropped during this check? Line 45 will throw NullPointerException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to put a null check. I think in this case we need to consider node as weigh away since we can't explicitly check for this. What are your thoughts?

Comment on lines +25 to +30
/**
* This function checks if the node is weighed away ie weighted routing weight is set to 0,
*
* @param nodeId the node
* @return true if the node has attribute value with shard routing weight set to zero, else false
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, we would need this check in the method below -

if (node.getRoles().contains(DiscoveryNodeRole.DATA_ROLE) == false) {
            return false;
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather -

if (node == null || node.getRoles().contains(DiscoveryNodeRole.DATA_ROLE) == false) {
            return false;
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants