Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763

Closed
Dieken opened this issue Feb 22, 2016 · 9 comments
Labels
discuss :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement

Comments

@Dieken
Copy link

Dieken commented Feb 22, 2016

I met this issue on ES-2.2.0, all ES data nodes have multiple hard disks and no RAID scheme is utilized, they are just a bunch of disks. The usable spaces of those disks on same host are different due to some strange implementation details in XFS (even two XFS fs have same capacity, same size directories and files, they don't always have same free space, I guess it's produced by file creations and deletions).

I pre-create many indicies with replica set to 0, 50 shards for 50 data nodes. The initial empty shard is very small, just several KB disk usage, ES surprisingly creates a lot of shards on the least used disks, for example, if one of ten disks on a data node has 10KB more free space, then it will contain about 8 more shards than other disks. The least used disks will be run out because I use hourly index pattern, all shards for a continuous hour range will go to the same disk. But other disks on that host still have a lot of free space.

I checked a little the code, https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/IndexService.java#L327, the IndexService instance is for per index, not a singleton, so "dataPathToShardCount" will always be empty due to zero replica for this index, so at https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/shard/ShardPath.java#L231, "count" will always be null, class "ShardPath" always selects the least used disk to allocate shard, but unluckily the empty shard is very small, so a lot of shards will be allocated to the least used disk.

This issue isn't limited to zero replica use case, the current ES logic will always prefer the least used disk and the very small empty shard always produces uneven distribution.

I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.

My workaround is to create RAID 0 for those disks on a single node, but RAID 0 is too risky if a host has about 10 disks, maybe I should choose RAID 6.

@dakrone
Copy link
Member

dakrone commented Feb 22, 2016

I suspect that pre-allocating the indices is not working as well with the prediction of shard sizes added in #11185, @mikemccand does that sound like a good culprit to you?

@dakrone
Copy link
Member

dakrone commented Feb 22, 2016

#12947 may also be at work here too

@mikemccand
Copy link
Contributor

@mikemccand does that sound like a good culprit to you?

Yes.

the IndexService instance is for per index

I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.

Oh this is bad: I agree, the dataPathToShardCount should (I and I intended it to be originally in #11185!) be across all indices, not just this one index. Hrmph.

However, I think another fix by @s1monw allowed ES shard balancing to "see" individual path.data on a single node, and move shards off on of a node's path.data that was filling up even if the other path.datas on the node had plenty of free space?

@Dieken
Copy link
Author

Dieken commented Feb 23, 2016

@mikemccand, the ability to move shards among different disks on a node is good to have, we always can't assure different indices have similar sizes, in fact they usually vary a lot due to hourly or daily traffic pattern.

But an initial even distribution is still very nice, it can avoid as more shard moving as possible, I wouldn't like to see ES node suddenly competes disk access with itself for shard moving, segment refresh, segment merge.

Oh, maybe I should just go to RAID :-)

@mikemccand
Copy link
Contributor

@Dieken Yeah I understand ... if you have ideas on how to fix the dataPathToShardCount to be across all indices instead, that would be great too ;)

@danopia
Copy link

danopia commented May 18, 2016

Any updates on this? I'm working with a write-heavy cluster and I value distributing disk throughput over file size. New indexes were just allocated with all shards a single path.disk out of the 4 due to unbalance in disk usage.

An API to relocate shards between path.datas would be enough to deal with this. Right now it looks like I have to shut down nodes and manually move folders.

Thanks

dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 14, 2017
When a node has multiple data paths configured, and is assigned all of the
shards for a particular index, it's possible now that all shards will be
assigned to the same path (see elastic#16763).

This change keeps the same behavior around determining the "best" path for a
shard based on space, however, it enforces limits for the number of shards on a
path for an index from the single-node perspective. For example:

Assume you had a node with 4 data paths, where `/path1` has a tremendously high
amount of disk space available compared to the other paths. If you create an
index with 5 primary shards, the previous behavior would be to assign all 5
shards to `/path1`.

This change would enforce a limit of 2 shards to each data path for that
particular node, so you would end up with the following distribution:

- `/path1` - 2 shards (because it has the most usable space)
- `/path2` - 1 shard
- `/path3` - 1 shard
- `/path4` - 1 shard

Note, however, that this limit is only enforced at the local node level for
simplicity in implementation, so if you had multiple nodes, the "limit" for the
node is still 2, so assuming you had enough nodes that there was only 2 shards
for this index assigned to this node, they would still both be assigned to
`/path1`.
dakrone added a commit that referenced this issue Oct 17, 2017
…26654)

* Balance shards for an index more evenly across multiple data paths

When a node has multiple data paths configured, and is assigned all of the
shards for a particular index, it's possible now that all shards will be
assigned to the same path (see #16763).

This change keeps the same behavior around determining the "best" path for a
shard based on space, however, it enforces limits for the number of shards on a
path for an index from the single-node perspective. For example:

Assume you had a node with 4 data paths, where `/path1` has a tremendously high
amount of disk space available compared to the other paths. If you create an
index with 5 primary shards, the previous behavior would be to assign all 5
shards to `/path1`.

This change would enforce a limit of 2 shards to each data path for that
particular node, so you would end up with the following distribution:

- `/path1` - 2 shards (because it has the most usable space)
- `/path2` - 1 shard
- `/path3` - 1 shard
- `/path4` - 1 shard

Note, however, that this limit is only enforced at the local node level for
simplicity in implementation, so if you had multiple nodes, the "limit" for the
node is still 2, so assuming you had enough nodes that there was only 2 shards
for this index assigned to this node, they would still both be assigned to
`/path1`.

* Switch from ObjectLongHashMap to regular HashMap

* Remove unneeded Files.isDirectory check

* Skip iterating directories when not necessary

* Add message to assert

* Implement different (better) ranking for node paths

This is the method we discussed

* Remove unused pathHasEnoughSpace method

* Use findFirst instead of .get(0);

* Update for master merge to fix compilation

Settings.putArray -> Settings.putList
dakrone added a commit that referenced this issue Oct 17, 2017
…26654)

* Balance shards for an index more evenly across multiple data paths

When a node has multiple data paths configured, and is assigned all of the
shards for a particular index, it's possible now that all shards will be
assigned to the same path (see #16763).

This change keeps the same behavior around determining the "best" path for a
shard based on space, however, it enforces limits for the number of shards on a
path for an index from the single-node perspective. For example:

Assume you had a node with 4 data paths, where `/path1` has a tremendously high
amount of disk space available compared to the other paths. If you create an
index with 5 primary shards, the previous behavior would be to assign all 5
shards to `/path1`.

This change would enforce a limit of 2 shards to each data path for that
particular node, so you would end up with the following distribution:

- `/path1` - 2 shards (because it has the most usable space)
- `/path2` - 1 shard
- `/path3` - 1 shard
- `/path4` - 1 shard

Note, however, that this limit is only enforced at the local node level for
simplicity in implementation, so if you had multiple nodes, the "limit" for the
node is still 2, so assuming you had enough nodes that there was only 2 shards
for this index assigned to this node, they would still both be assigned to
`/path1`.

* Switch from ObjectLongHashMap to regular HashMap

* Remove unneeded Files.isDirectory check

* Skip iterating directories when not necessary

* Add message to assert

* Implement different (better) ranking for node paths

This is the method we discussed

* Remove unused pathHasEnoughSpace method

* Use findFirst instead of .get(0);

* Update for master merge to fix compilation

Settings.putArray -> Settings.putList
@dakrone
Copy link
Member

dakrone commented Oct 19, 2017

With #26654 and #27039, the distribution problem should be fixed now.

This doesn't add the API to relocate shards between path.data paths though, that would be a different thing.

@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 14, 2018
@DaveCTurner
Copy link
Contributor

@dakrone I'm tempted to close this as there hasn't been overwhelming interest in it and I don't think we are planning on working on sub-node allocation decisions in the foreseeable future. WDYT?

@dakrone
Copy link
Member

dakrone commented Mar 15, 2018

@DaveCTurner I agree, I think this can be closed for now as we've addressed a portion of it.

If anyone disagrees let us know and we can revisit!

@dakrone dakrone closed this as completed Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement
Projects
None yet
Development

No branches or pull requests

7 participants