-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very uneven distribution of empty shards on multiple data paths if the free spaces are different #16763
Comments
I suspect that pre-allocating the indices is not working as well with the prediction of shard sizes added in #11185, @mikemccand does that sound like a good culprit to you? |
#12947 may also be at work here too |
Yes.
Oh this is bad: I agree, the However, I think another fix by @s1monw allowed ES shard balancing to "see" individual |
@mikemccand, the ability to move shards among different disks on a node is good to have, we always can't assure different indices have similar sizes, in fact they usually vary a lot due to hourly or daily traffic pattern. But an initial even distribution is still very nice, it can avoid as more shard moving as possible, I wouldn't like to see ES node suddenly competes disk access with itself for shard moving, segment refresh, segment merge. Oh, maybe I should just go to RAID :-) |
@Dieken Yeah I understand ... if you have ideas on how to fix the |
Any updates on this? I'm working with a write-heavy cluster and I value distributing disk throughput over file size. New indexes were just allocated with all shards a single path.disk out of the 4 due to unbalance in disk usage. An API to relocate shards between path.datas would be enough to deal with this. Right now it looks like I have to shut down nodes and manually move folders. Thanks |
When a node has multiple data paths configured, and is assigned all of the shards for a particular index, it's possible now that all shards will be assigned to the same path (see elastic#16763). This change keeps the same behavior around determining the "best" path for a shard based on space, however, it enforces limits for the number of shards on a path for an index from the single-node perspective. For example: Assume you had a node with 4 data paths, where `/path1` has a tremendously high amount of disk space available compared to the other paths. If you create an index with 5 primary shards, the previous behavior would be to assign all 5 shards to `/path1`. This change would enforce a limit of 2 shards to each data path for that particular node, so you would end up with the following distribution: - `/path1` - 2 shards (because it has the most usable space) - `/path2` - 1 shard - `/path3` - 1 shard - `/path4` - 1 shard Note, however, that this limit is only enforced at the local node level for simplicity in implementation, so if you had multiple nodes, the "limit" for the node is still 2, so assuming you had enough nodes that there was only 2 shards for this index assigned to this node, they would still both be assigned to `/path1`.
…26654) * Balance shards for an index more evenly across multiple data paths When a node has multiple data paths configured, and is assigned all of the shards for a particular index, it's possible now that all shards will be assigned to the same path (see #16763). This change keeps the same behavior around determining the "best" path for a shard based on space, however, it enforces limits for the number of shards on a path for an index from the single-node perspective. For example: Assume you had a node with 4 data paths, where `/path1` has a tremendously high amount of disk space available compared to the other paths. If you create an index with 5 primary shards, the previous behavior would be to assign all 5 shards to `/path1`. This change would enforce a limit of 2 shards to each data path for that particular node, so you would end up with the following distribution: - `/path1` - 2 shards (because it has the most usable space) - `/path2` - 1 shard - `/path3` - 1 shard - `/path4` - 1 shard Note, however, that this limit is only enforced at the local node level for simplicity in implementation, so if you had multiple nodes, the "limit" for the node is still 2, so assuming you had enough nodes that there was only 2 shards for this index assigned to this node, they would still both be assigned to `/path1`. * Switch from ObjectLongHashMap to regular HashMap * Remove unneeded Files.isDirectory check * Skip iterating directories when not necessary * Add message to assert * Implement different (better) ranking for node paths This is the method we discussed * Remove unused pathHasEnoughSpace method * Use findFirst instead of .get(0); * Update for master merge to fix compilation Settings.putArray -> Settings.putList
…26654) * Balance shards for an index more evenly across multiple data paths When a node has multiple data paths configured, and is assigned all of the shards for a particular index, it's possible now that all shards will be assigned to the same path (see #16763). This change keeps the same behavior around determining the "best" path for a shard based on space, however, it enforces limits for the number of shards on a path for an index from the single-node perspective. For example: Assume you had a node with 4 data paths, where `/path1` has a tremendously high amount of disk space available compared to the other paths. If you create an index with 5 primary shards, the previous behavior would be to assign all 5 shards to `/path1`. This change would enforce a limit of 2 shards to each data path for that particular node, so you would end up with the following distribution: - `/path1` - 2 shards (because it has the most usable space) - `/path2` - 1 shard - `/path3` - 1 shard - `/path4` - 1 shard Note, however, that this limit is only enforced at the local node level for simplicity in implementation, so if you had multiple nodes, the "limit" for the node is still 2, so assuming you had enough nodes that there was only 2 shards for this index assigned to this node, they would still both be assigned to `/path1`. * Switch from ObjectLongHashMap to regular HashMap * Remove unneeded Files.isDirectory check * Skip iterating directories when not necessary * Add message to assert * Implement different (better) ranking for node paths This is the method we discussed * Remove unused pathHasEnoughSpace method * Use findFirst instead of .get(0); * Update for master merge to fix compilation Settings.putArray -> Settings.putList
@dakrone I'm tempted to close this as there hasn't been overwhelming interest in it and I don't think we are planning on working on sub-node allocation decisions in the foreseeable future. WDYT? |
@DaveCTurner I agree, I think this can be closed for now as we've addressed a portion of it. If anyone disagrees let us know and we can revisit! |
I met this issue on ES-2.2.0, all ES data nodes have multiple hard disks and no RAID scheme is utilized, they are just a bunch of disks. The usable spaces of those disks on same host are different due to some strange implementation details in XFS (even two XFS fs have same capacity, same size directories and files, they don't always have same free space, I guess it's produced by file creations and deletions).
I pre-create many indicies with replica set to 0, 50 shards for 50 data nodes. The initial empty shard is very small, just several KB disk usage, ES surprisingly creates a lot of shards on the least used disks, for example, if one of ten disks on a data node has 10KB more free space, then it will contain about 8 more shards than other disks. The least used disks will be run out because I use hourly index pattern, all shards for a continuous hour range will go to the same disk. But other disks on that host still have a lot of free space.
I checked a little the code, https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/IndexService.java#L327, the IndexService instance is for per index, not a singleton, so "dataPathToShardCount" will always be empty due to zero replica for this index, so at https://github.com/elastic/elasticsearch/blob/v2.2.0/core/src/main/java/org/elasticsearch/index/shard/ShardPath.java#L231, "count" will always be null, class "ShardPath" always selects the least used disk to allocate shard, but unluckily the empty shard is very small, so a lot of shards will be allocated to the least used disk.
This issue isn't limited to zero replica use case, the current ES logic will always prefer the least used disk and the very small empty shard always produces uneven distribution.
I feel the "dataPathToShardCount" map should be global to all indices on a node, not local to a single index, but maybe there is better solution.
My workaround is to create RAID 0 for those disks on a single node, but RAID 0 is too risky if a host has about 10 disks, maybe I should choose RAID 6.
The text was updated successfully, but these errors were encountered: