Integration with Ultrawarm #95

kaituo · 2020-04-28T22:46:22Z

Issue #, if available:

Description of changes:

Ultrawarm introduces warm nodes into the ES cluster. Currently, we distribute model partitions to all data nodes in the cluster randomly, which could cause a model performance downgrade issue once warm nodes are throttled due to resource limitations. The PR excludes warm nodes to place model partitions.

Since index shards are hosted on hot nodes, AD's coordinating nodes are in hot nodes as well. We don't need to send HourlyCron job and stats requests to warm nodes anymore. This PR implements those changes.

Testing done:

Verified AD runs only in hot nodes.
stats API and HourlyCron still works.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Ultrawarm introduces warm nodes into the ES cluster. Currently, we distribute model partitions to all data nodes in the cluster randomly, which could cause a model performance downgrade issue once warm nodes are throttled due to resource limitations. The PR excludes warm nodes to place model partitions. Since index shards are hosted on hot nodes, AD's coordinating nodes are in hot nodes as well. We don't need to send HourlyCron job and stats requests to warm nodes anymore. This PR implements those changes. Testing done: 1. Verified AD runs only in hot nodes. 2. stats API and HourlyCron still works.

wnbts

Question for transparency. Why ultrawarm nodes are not eligible for anomaly detection?

wnbts · 2020-04-28T22:56:14Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+import com.amazon.opendistroforelasticsearch.ad.constant.CommonName;
+import com.carrotsearch.hppc.cursors.ObjectObjectCursor;
+
+public class ClusterStateUtils {


Minor. Documentation is missing for public classes and methods.

wnbts · 2020-04-28T22:58:49Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+public class ClusterStateUtils {
+    private static final Logger LOG = LogManager.getLogger(ClusterStateUtils.class);
+    private final ClusterService clusterService;
+    private final Map<String, String> ignoredAttributes = new HashMap<String, String>();


Minor. The code will be more robust if this state is injected at constructor rather than hardcoded so when the config changes, constructor and unit tests do not need to change.

wnbts · 2020-04-28T23:02:49Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+    private final ClusterService clusterService;
+    private final Map<String, String> ignoredAttributes = new HashMap<String, String>();
+
+    @Inject


Question. Is this needed?

Yes because transport action needs this class. Transport action constructor uses Guice to find injected dependencies. Dependency classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor.

Oh, I found it. It's the RestStatsAnomalyDetectorAction, right? Thanks!

It is StopDetectorTransportAction.

Do dependencies (to be injected) require @Inject annotation? It make senses for dependents to be annotated as entry point. For example, RCFResultTransportAction (a dependent) is annotated but its dependencies such as ADCircuitBreakerService and ModelManager are not.

ClusterStateUtils is a dependency not a dependent, right?

We need the injection because no implementation for java.util.HashMap was bound. ModelManager does not @Inject becasue all of its parameters are bound using AnomalyDetectorPlugin.createComponents. Now we don't need inject since I changed implementation. Please see the new PR.

wnbts · 2020-04-28T23:22:24Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+        ignoredAttributes.put(CommonName.BOX_TYPE_KEY, CommonName.WARM_BOX_TYPE);
+    }
+
+    public ImmutableOpenMap<String, DiscoveryNode> getEligibleDataNodes() {


Question. Why not using Map or unmodifiableMap so the specific class doesn't leak into client code?

That's what clusterService.state().nodes().getDataNodes() returns. ImmutableOpenMap is not a Map or unmodifiableMap. ES defined it in their own way.

since the new code is a wrapper over the api and creates a new map, the wrapper may hide the implementation class, particular when map is much friendly to clients.

Changed to return an array instead. Please see the context from Sorabh's comments.

wnbts · 2020-04-28T23:25:13Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+
+        for (Iterator<ObjectObjectCursor<String, DiscoveryNode>> it = dataNodes.iterator(); it.hasNext();) {
+            ObjectObjectCursor<String, DiscoveryNode> cursor = it.next();
+            if (!isIgnoredNode(cursor.value)) {


Minor. isEligibleNode is easier to use than double negative.

This is one place where isIgnoredNode gets used. In some other places like ADClusterEventListener, we don't use double negative.

i just did a count. there are three double negatives in this pr, two from ADClusterEventListener and one from ClusterStateUtils, while one without a second negation. This reduces readability unnecessarily.

changed to isEligibleNode with a different implementation. Please see it in another PR.

wnbts · 2020-04-28T23:29:06Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+        }
+        for (Map.Entry<String, String> entry : ignoredAttributes.entrySet()) {
+            String attribute = node.getAttributes().get(entry.getKey());
+            if (attribute != null && attribute.equals(entry.getValue())) {


Minor. Or if (entry.getValue().equals(attribute)) {...}

good point. Changed.

wnbts · 2020-04-29T00:25:01Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/cluster/HourlyCron.java

    private Client client;

-    public HourlyCron(ClusterService clusterService, Client client) {
-        this.clusterService = clusterService;
+    public HourlyCron(Client client, ClusterStateUtils clientStateUtils) {


Minor. clusterStateUtils

kaituo · 2020-04-29T16:01:43Z

Question for transparency. Why ultrawarm nodes are not eligible for anomaly detection?

Warm nodes are performance sensitive. We don't want to mess up with them.

sohami · 2020-04-29T23:10:21Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/rest/RestStatsAnomalyDetectorAction.java

+            String[] nodeIdsArr = nodesIdsStr.split(",");
+            adStatsRequest = new ADStatsRequest(nodeIdsArr);
+        } else {
+            DiscoveryNode[] dataNodes = clusterStateUtils.getEligibleDataNodes().values().toArray(DiscoveryNode.class);


it looks like everywhere the map returned by getEligibleDataNodes() is converted to an array. Would be cleaner if we return an array from getEligibleDataNodes() itself.

yes, changed to return an array.

sohami · 2020-04-29T23:16:09Z

...ain/java/com/amazon/opendistroforelasticsearch/ad/transport/StopDetectorTransportAction.java


    @Inject
    public StopDetectorTransportAction(
        TransportService transportService,
-        ClusterService clusterService,
+        ClusterStateUtils clientStateUtils,


clientStateUtils -> clusterStateUtils ?

This is changed to your DiscoveryNodeFilterer.

sohami · 2020-04-29T23:54:36Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/util/ClusterStateUtils.java

+import com.amazon.opendistroforelasticsearch.ad.constant.CommonName;
+import com.carrotsearch.hppc.cursors.ObjectObjectCursor;
+
+public class ClusterStateUtils {


Since its a util class I would say you can just have a static methods in it and the caller doesn't have to create an instance of it. Something like this:

public final class DiscoveryNodeFilterer { private DiscoveryNodeFilterer() { } public static DiscoveryNode[] getHotDataNodes(ClusterState state) { final List<DiscoveryNode> eligibleNodes = new ArrayList<>(); final HotDataNodePredicate eligibleNodeFilter = new HotDataNodePredicate(); for(DiscoveryNode node : state.nodes()) { if (eligibleNodeFilter.test(node)) { eligibleNodes.add(node); } } return eligibleNode.toArray(new DiscoveryNode[0]); } static class HotDataNodePredicate implements Predicate<DiscoveryNode> { @Override public boolean test(DiscoveryNode discoveryNode) { return discoveryNode.isDataNode() && discoveryNode.getAttributes().getOrDefault(CommonName.BOX_TYPE_KEY, CommonName.HOT_BOX_TYPE).equals(CommonName.HOT_BOX_TYPE); } } }

Or if you want to support multiple attributes then you can use DiscoveryNodeFilters but the catch in this implementation is that it doesn't consider the DiscoveryNode with null value for attribute as an eligible node. Whereas in the case here a hot data node can have null value for box_type.

In my implementation, I would ignore a node if its box type equals to warm. If we have null value for the box type attribute, we don't ignore them.

Will use your version since we don't have a use case to support multiple attributes now.

I am using an instance methods since that's easier for testing than static methods. And since we have dependence injection, we only have one copy of the class.

Ultrawarm introduces warm nodes into the ES cluster. Currently, we distribute model partitions to all data nodes in the cluster randomly, which could cause a model performance downgrade issue once warm nodes are throttled due to resource limitations. The PR excludes warm nodes to place model partitions. Since index shards are hosted on hot nodes, AD's coordinating nodes are in hot nodes as well. We don't need to send HourlyCron job and stats requests to warm nodes anymore. This PR implements those changes. Testing done: 1. Verified AD runs only in hot nodes. 2. stats API and HourlyCron still works.

* Integration with Ultrawarm (#95) Ultrawarm introduces warm nodes into the ES cluster. Currently, we distribute model partitions to all data nodes in the cluster randomly, which could cause a model performance downgrade issue once warm nodes are throttled due to resource limitations. The PR excludes warm nodes to place model partitions. Since index shards are hosted on hot nodes, AD's coordinating nodes are in hot nodes as well. We don't need to send HourlyCron job and stats requests to warm nodes anymore. This PR implements those changes. Testing done: 1. Verified AD runs only in hot nodes. 2. stats API and HourlyCron still works. * Integration with Ultrawarm - Follow up (#97) This is a follow up PR to address comments. Testing done: 1. gradle build passes 2. Verified AD runs only in hot nodes. 3. stats API and HourlyCron still works.

kaituo requested review from jmazanec15 and jngz-es April 28, 2020 22:46

kaituo self-assigned this Apr 28, 2020

kaituo force-pushed the kraken_1.4 branch from c324ced to 7664382 Compare April 28, 2020 22:49

wnbts approved these changes Apr 29, 2020

View reviewed changes

jngz-es approved these changes Apr 29, 2020

View reviewed changes

Address Lai's comments

4e74ce3

kaituo merged commit 80997fd into opendistro-for-elasticsearch:opendistro-1.4 Apr 29, 2020

sohami reviewed Apr 30, 2020

View reviewed changes

kaituo mentioned this pull request May 1, 2020

Integration with Ultrawarm - Follow up #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with Ultrawarm #95

Integration with Ultrawarm #95

kaituo commented Apr 28, 2020

wnbts left a comment

wnbts Apr 28, 2020

kaituo Apr 29, 2020

wnbts Apr 28, 2020

kaituo Apr 29, 2020

wnbts Apr 28, 2020

kaituo Apr 29, 2020 •

edited

Loading

wnbts Apr 30, 2020

kaituo Apr 30, 2020

wnbts Apr 30, 2020

kaituo May 1, 2020

wnbts Apr 28, 2020

kaituo Apr 29, 2020

wnbts Apr 30, 2020

kaituo May 1, 2020

wnbts Apr 28, 2020

kaituo Apr 29, 2020

wnbts Apr 30, 2020

kaituo May 1, 2020

wnbts Apr 28, 2020

kaituo Apr 29, 2020

wnbts Apr 29, 2020

kaituo Apr 29, 2020

kaituo commented Apr 29, 2020

sohami Apr 29, 2020

kaituo May 1, 2020

sohami Apr 29, 2020

kaituo May 1, 2020

sohami Apr 29, 2020

kaituo Apr 30, 2020 •

edited

Loading

Integration with Ultrawarm #95

Integration with Ultrawarm #95

Conversation

kaituo commented Apr 28, 2020

wnbts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo commented Apr 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Apr 30, 2020 • edited Loading

Choose a reason for hiding this comment

kaituo Apr 29, 2020 •

edited

Loading

kaituo Apr 30, 2020 •

edited

Loading