very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

Dieken · 2016-02-22T09:17:39Z

I met this issue on ES-2.2.0, all ES data nodes have multiple hard disks and no RAID scheme is utilized, they are just a bunch of disks. The usable spaces of those disks on same host are different due to some strange implementation details in XFS (even two XFS fs have same capacity, same size directories and files, they don't always have same free space, I guess it's produced by file creations and deletions).

I pre-create many indicies with replica set to 0, 50 shards for 50 data nodes. The initial empty shard is very small, just several KB disk usage, ES surprisingly creates a lot of shards on the least used disks, for example, if one of ten disks on a data node has 10KB more free space, then it will contain about 8 more shards than other disks. The least used disks will be run out because I use hourly index pattern, all shards for a continuous hour range will go to the same disk. But other disks on that host still have a lot of free space.

I checked a little the code, https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/IndexService.java#L327, the IndexService instance is for per index, not a singleton, so "dataPathToShardCount" will always be empty due to zero replica for this index, so at https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/shard/ShardPath.java#L231, "count" will always be null, class "ShardPath" always selects the least used disk to allocate shard, but unluckily the empty shard is very small, so a lot of shards will be allocated to the least used disk.

This issue isn't limited to zero replica use case, the current ES logic will always prefer the least used disk and the very small empty shard always produces uneven distribution.

I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.

My workaround is to create RAID 0 for those disks on a single node, but RAID 0 is too risky if a host has about 10 disks, maybe I should choose RAID 6.

dakrone · 2016-02-22T16:53:17Z

I suspect that pre-allocating the indices is not working as well with the prediction of shard sizes added in #11185, @mikemccand does that sound like a good culprit to you?

dakrone · 2016-02-22T16:54:03Z

#12947 may also be at work here too

mikemccand · 2016-02-22T18:59:23Z

@mikemccand does that sound like a good culprit to you?

Yes.

the IndexService instance is for per index

I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.

Oh this is bad: I agree, the dataPathToShardCount should (I and I intended it to be originally in #11185!) be across all indices, not just this one index. Hrmph.

However, I think another fix by @s1monw allowed ES shard balancing to "see" individual path.data on a single node, and move shards off on of a node's path.data that was filling up even if the other path.datas on the node had plenty of free space?

Dieken · 2016-02-23T02:28:03Z

@mikemccand, the ability to move shards among different disks on a node is good to have, we always can't assure different indices have similar sizes, in fact they usually vary a lot due to hourly or daily traffic pattern.

But an initial even distribution is still very nice, it can avoid as more shard moving as possible, I wouldn't like to see ES node suddenly competes disk access with itself for shard moving, segment refresh, segment merge.

Oh, maybe I should just go to RAID :-)

mikemccand · 2016-02-23T17:00:53Z

@Dieken Yeah I understand ... if you have ideas on how to fix the dataPathToShardCount to be across all indices instead, that would be great too ;)

danopia · 2016-05-18T07:38:57Z

Any updates on this? I'm working with a write-heavy cluster and I value distributing disk throughput over file size. New indexes were just allocated with all shards a single path.disk out of the 4 due to unbalance in disk usage.

An API to relocate shards between path.datas would be enough to deal with this. Right now it looks like I have to shut down nodes and manually move folders.

Thanks

When a node has multiple data paths configured, and is assigned all of the shards for a particular index, it's possible now that all shards will be assigned to the same path (see elastic#16763). This change keeps the same behavior around determining the "best" path for a shard based on space, however, it enforces limits for the number of shards on a path for an index from the single-node perspective. For example: Assume you had a node with 4 data paths, where `/path1` has a tremendously high amount of disk space available compared to the other paths. If you create an index with 5 primary shards, the previous behavior would be to assign all 5 shards to `/path1`. This change would enforce a limit of 2 shards to each data path for that particular node, so you would end up with the following distribution: - `/path1` - 2 shards (because it has the most usable space) - `/path2` - 1 shard - `/path3` - 1 shard - `/path4` - 1 shard Note, however, that this limit is only enforced at the local node level for simplicity in implementation, so if you had multiple nodes, the "limit" for the node is still 2, so assuming you had enough nodes that there was only 2 shards for this index assigned to this node, they would still both be assigned to `/path1`.

…26654) * Balance shards for an index more evenly across multiple data paths When a node has multiple data paths configured, and is assigned all of the shards for a particular index, it's possible now that all shards will be assigned to the same path (see #16763). This change keeps the same behavior around determining the "best" path for a shard based on space, however, it enforces limits for the number of shards on a path for an index from the single-node perspective. For example: Assume you had a node with 4 data paths, where `/path1` has a tremendously high amount of disk space available compared to the other paths. If you create an index with 5 primary shards, the previous behavior would be to assign all 5 shards to `/path1`. This change would enforce a limit of 2 shards to each data path for that particular node, so you would end up with the following distribution: - `/path1` - 2 shards (because it has the most usable space) - `/path2` - 1 shard - `/path3` - 1 shard - `/path4` - 1 shard Note, however, that this limit is only enforced at the local node level for simplicity in implementation, so if you had multiple nodes, the "limit" for the node is still 2, so assuming you had enough nodes that there was only 2 shards for this index assigned to this node, they would still both be assigned to `/path1`. * Switch from ObjectLongHashMap to regular HashMap * Remove unneeded Files.isDirectory check * Skip iterating directories when not necessary * Add message to assert * Implement different (better) ranking for node paths This is the method we discussed * Remove unused pathHasEnoughSpace method * Use findFirst instead of .get(0); * Update for master merge to fix compilation Settings.putArray -> Settings.putList

dakrone · 2017-10-19T23:02:15Z

With #26654 and #27039, the distribution problem should be fixed now.

This doesn't add the API to relocate shards between path.data paths though, that would be a different thing.

DaveCTurner · 2018-03-15T09:28:18Z

@dakrone I'm tempted to close this as there hasn't been overwhelming interest in it and I don't think we are planning on working on sub-node allocation decisions in the foreseeable future. WDYT?

dakrone · 2018-03-15T15:56:07Z

@DaveCTurner I agree, I think this can be closed for now as we've addressed a portion of it.

If anyone disagrees let us know and we can revisit!

clintongormley added >enhancement discuss :Allocation labels Feb 28, 2016

bleskes mentioned this issue Jul 26, 2016

Elasticsearch continues to attempt writes to an index that has shards on disks that are full #19584

Closed

dakrone mentioned this issue Sep 14, 2017

Balance shards for an index more evenly across multiple data paths #26654

Merged

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

dakrone closed this as completed Mar 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

Dieken commented Feb 22, 2016

dakrone commented Feb 22, 2016

dakrone commented Feb 22, 2016

mikemccand commented Feb 22, 2016

Dieken commented Feb 23, 2016

mikemccand commented Feb 23, 2016

danopia commented May 18, 2016 •

edited

Loading

dakrone commented Oct 19, 2017

DaveCTurner commented Mar 15, 2018

dakrone commented Mar 15, 2018

very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

Comments

Dieken commented Feb 22, 2016

dakrone commented Feb 22, 2016

dakrone commented Feb 22, 2016

mikemccand commented Feb 22, 2016

Dieken commented Feb 23, 2016

mikemccand commented Feb 23, 2016

danopia commented May 18, 2016 • edited Loading

dakrone commented Oct 19, 2017

DaveCTurner commented Mar 15, 2018

dakrone commented Mar 15, 2018

danopia commented May 18, 2016 •

edited

Loading