A method to reduce the time cost to update cluster state #46941

kkewwei · 2019-09-20T17:14:44Z

ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
As it's known, Updating cluster state on master node will cost too much time, which seriously affects the size and stability of the cluster. In out product, updating cluster state will cost 15s+ with the cluster of 50 nodes and 3,000 indices, 60,000 shard, the experience is very poor when we want to create index and delete index.
To find out why it cost so much time on updating cluste state, I get the thread stack about updateTask, such that:

"elasticsearch[node1][clusterService#updateTask][T#1]" #32 daemon prio=5 os_prio=0 tid=0x00007f5d703a2800 nid=0x8252 runnable [0x00007f5c22b71000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041)
        at org.elasticsearch.cluster.routing.RoutingNode.shardsWithState(RoutingNode.java:148)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:90)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.getDiskUsage(DiskThresholdDecider.java:320)
        at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(DiskThresholdDecider.java:265)
        at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:105)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(BalancedShardsAllocator.java:687)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(BalancedShardsAllocator.java:648)
        at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:123)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)

I try several times and get the same thread stack. it seems that DiskThresholdDecider.sizeOfRelocatingShards will cost too much time, the code is as follow:

 static long sizeOfRelocatingShards(RoutingNode node, RoutingAllocation allocation,
                                       boolean subtractShardsMovingAway, String dataPath) {
      ClusterInfo clusterInfo = allocation.clusterInfo();
      long totalSize = 0;
      for (ShardRouting routing : node.shardsWithState(ShardRoutingState.RELOCATING, 
     ShardRoutingState.INITIALIZING)) {
             ......
      }
      ......
}

It says that: to test whether the shard can remain stay on the node or not ,we will get the size of relocating shards, then we will get all the shards(about 6,000 shards on one node) of the node, check the shards if is be RELOCATING or INITIALIZING. This is only one shard, there have 60,000 shard need to be test, and will be 60,000 * 6,000 times checkout, which will cost too much times.
I find that we can use the settings to avoid this check: "cluster.routing.allocation.disk.include_relocations":"false". when i set it to be false, the time to update cluster state decreases from 15s to 3s which has achives better result.
if we could set the cluster.routing.allocation.disk.include_relocations to be false by default, most of us will ignore the default setting. or if we could reserve the shard state of relocating and initializing about every node in cluster state, so we will not find out the shards every time by checking every time when updating cluster state.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-20T19:19:40Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-22T17:16:40Z

        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)

Since #44433 (i.e. 7.4.0) applyStartedShards no longer calls reroute.

cluster of 50 nodes and 3,000 indices, 60,000 shard ... about 6,000 shards on one node

I don't fully understand how these numbers add up, but I can say that this is an unreasonably large number of shards per node. Since #34892 (i.e. 7.0.0) you are forbidden from having so many shards in a cluster of this size.

if we could set the cluster.routing.allocation.disk.include_relocations to be false by default

I can't recommend this and don't think it will be acceptable as a default. If you set this setting to false you will continue to experience the kinds of overshoot fixed by #46079.

There may be improvements to make in this area, but we need to investigate whether the issue described here still exists in clusters running a more recent version and with a configuration that's within recommended limits.

DaveCTurner · 2019-09-23T06:59:50Z

The time spent finding the initialising and relocating shards when computing relocations is heavily dependent on the number of shards per node. Measuring no-op reroutes in which there are no shards moving (the common case) and the disk threshold decider is in play:

Shards	Nodes	Shards per node	Reroute time without relocations	Reroute time with relocations
60000	10	6000	~250ms	~15000ms
60000	60	1000	~250ms	~4000ms
10000	10	1000	~60ms	~250ms

This shows that the savings are significant when you have too many shards per node, but much less dramatic in clusters that respect the 1000-shards-per-node limit.

We could for instance precompute the INITIALIZING and RELOCATING shards on each RoutingNode rather than scanning for them afresh each time. Given the data above I'm not sure the extra complexity is worth it, but I'm marking this for team discussion to get others' inputs.

kkewwei · 2019-09-25T06:07:41Z

It's my pleasure to wait for your reply.

DaveCTurner · 2019-10-02T13:59:36Z

@kkewwei we discussed this today and think it could be worthwhile to explore adding some precomputation to RoutingNode to track the INITIALIZING and RELOCATING shards for easy access by the disk-based shard allocator. We were not convinced that it is definitely worth pursuing, it depends how simple the change is. Would you be interested in opening a PR for that?

DaveCTurner · 2019-10-02T14:08:49Z

We also agreed that setting include_relocations to false is a bad idea, so I've opened #47443 to deprecate this setting and will remove it in a future release.

kkewwei · 2019-10-03T13:37:32Z

@DaveCTurner, I would like to do it.

DaveCTurner · 2019-10-08T07:12:27Z

Thanks for volunteering @kkewwei, looking forward to seeing your PR.

Today a couple of allocation deciders iterate through all the shards on a node to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster state updates in clusters with very high-density nodes holding many thousands of shards even if those shards belong to closed or frozen indices. This commit pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up this search. Closes #46941 Relates #48579 Co-authored-by: "hongju.xhj" <[email protected]>

DaveCTurner added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Sep 20, 2019

DaveCTurner added the team-discuss label Sep 23, 2019

ywelsch removed the team-discuss label Oct 3, 2019

kkewwei mentioned this issue Oct 9, 2019

Faster access to INITIALIZING/RELOCATING shards #47817

Merged

DaveCTurner closed this as completed in #47817 Oct 31, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A method to reduce the time cost to update cluster state #46941

A method to reduce the time cost to update cluster state #46941

kkewwei commented Sep 20, 2019 •

edited

Loading

elasticmachine commented Sep 20, 2019

DaveCTurner commented Sep 22, 2019 •

edited

Loading

DaveCTurner commented Sep 23, 2019

kkewwei commented Sep 25, 2019

DaveCTurner commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

kkewwei commented Oct 3, 2019

DaveCTurner commented Oct 8, 2019

A method to reduce the time cost to update cluster state #46941

A method to reduce the time cost to update cluster state #46941

Comments

kkewwei commented Sep 20, 2019 • edited Loading

elasticmachine commented Sep 20, 2019

DaveCTurner commented Sep 22, 2019 • edited Loading

DaveCTurner commented Sep 23, 2019

kkewwei commented Sep 25, 2019

DaveCTurner commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

kkewwei commented Oct 3, 2019

DaveCTurner commented Oct 8, 2019

kkewwei commented Sep 20, 2019 •

edited

Loading

DaveCTurner commented Sep 22, 2019 •

edited

Loading