You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allocation explain API is stuck in AWAITING_INFO for some shards even when the deciders are returning NO
In this case, allocation explain should return NO decision because we already know deciders are returning NO. Instead , the API ends up calling AsyncShardFetch.asyncFetch() which it should not do.
We will see that this also impacts manual reroute flows, leading to an unnecessary AsyncShardFetch.asyncFetch() call.
Both these issues are happening because ShardsBatchGatewayAllocator.InternalReplicaBatchShardAllocator.hasInitiatedFetching() function is not correctly implemented, and always returns true.
hasInitiatedFetching() function checks if a batchId exists for a shard.
However, this will always be true, because the batchId is defined for the shard before this function is ever called.
Flow for allocation explain API
Here we will explain the flow, show why hasInitiatedFetching() always returns true, and how that causes allocation explain API to return AWAITING_INFO instead of NO.
Assume that we have a replica shard for which deciders are returning a NO decision [could be due to low disk, too many replicas], and allocation explain API is called for this shard.
In this function, we assign a batchId for all unassigned shards, and create a new one if required.
At this point, every unassigned shard would have a valid batch id.
failures.add(newFailedNodeException(node.getId(), "total failure in fetching", e));
}
processAsyncFetch(null, failures, fetchingRound);
}
});
}
AsyncShardFetch.fetchData() triggers an asyncFetch() and returns the fetcher object. At this point, the fetching would have just started in async manner. So we don't have any data from the nodes yet, so the nodeShardStores is null.
Since nodeShardStores is null, we return FETCHING_SHARD_DATA as the final result, which internally maps to AWAITING_INFO. This is why we see the AWAITING_INFO result instead of a NO result in allocation explain API the first time.
Eventually, when all the nodes respond and the async fetch finally completes (could be seconds or minutes later), onResponse() of asyncFetch() is called which calls processAsyncFetch()
Here we trigger a new reroute, this time with explain=false:
This time, this expression evaluates to TRUE because explain=false, so the function returns NO, and the shard is marked ineligible.
This is expected behaviour - when deciders return NO, the shard should be ineligible, so this is fine.
// assert that fetcher and shards are the same as batched shards
assertbatchInfo.size() == asyncBatch.shardAttributesMap.size() : "Shards size is not equal to fetcher size";
}
So any information about this shard is now removed from the batch cache. So all the work from the previous fetchData also goes to waste.
So when we call allocation explain a second time, the cache will be empty, we will again restart the whole flow and return AWAITING_INFO forever...
Flow for manual reroute
The flow for manual reroute is affected in a very similar way.
When we call manual reroute API, the first API call is made with explain=true. This is hardcoded in the logic for manual reroutes.
Assume again that deciders are returning NO decision for our replica shard for any reason.
if (allocationDecision.type() != Decision.Type.YES && (!explain || !hasInitiatedFetching(shardRouting))) {
Now because hasInitiatedFetching returns true, the entire expression above returns False and the function getUnassignedShardAllocationDecision() returns null.
So we mark the shard as eligible.
** This is wrong. At this point, we already know the deciders are returning NO, so the shard should not have been marked eligible **
Because we incorrectly mark the shard as eligible, we trigger fetchData() here, which eventually calls AsyncShardFetch.fetchData():
This again triggers a second reroute, this time with explain = false [hardcoded].
Now in the second reroute, because explain = false, the shard is correctly marked as ineligible and is wiped from the cache.
So essentially we triggered the first fetchData and second reroute for no reason - The first reroute should have returned NO decision without ever calling fetchData, and we should have never called a second reroute.
FIX
We need to fix hasInitiatedFetching() function to correctly check if fetching has happened at least once or not.
The implementation in non batch mode is correct so we see these issues.
The only case where we SHOULD call AsyncShardFetch.fetchData() for decision = NO, is when we already have the data from all nodes available in the cache. This way we know that when we call AsyncShardFetch.fetchData(), it won't trigger a new AsyncShardFetch.asyncFetch() because all the nodes in the cache will have an entry
Essentially, in allocation explain or reroute flow for shards with decision=NO, we should only call AsyncShardFetch.fetchData() when we know that the cache already has the data, so we can guarantee a new asyncFetch would not get triggered for this ineligible shard.
The only point of using the data from the node cache in this case, is that we can also populate shard store info for each node along with the NO decision.
When we call getAllocationDecision() with this non empty cache, this function will augment the number of matching bytes that are there on each node for a specific shard. This gets appended to the NO decision here:
Describe the bug
AsyncShardFetch.asyncFetch()
which it should not do.AsyncShardFetch.asyncFetch()
call.Both these issues are happening because
ShardsBatchGatewayAllocator.InternalReplicaBatchShardAllocator.hasInitiatedFetching()
function is not correctly implemented, and always returns true.Why does hasInitiatedFetching always return true?
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 518 to 521 in 1305002
hasInitiatedFetching()
function checks if a batchId exists for a shard.However, this will always be true, because the batchId is defined for the shard before this function is ever called.
Flow for allocation explain API
Here we will explain the flow, show why
hasInitiatedFetching()
always returns true, and how that causes allocation explain API to return AWAITING_INFO instead of NO.Assume that we have a replica shard for which deciders are returning a NO decision [could be due to low disk, too many replicas], and allocation explain API is called for this shard.
Begin at this code path:
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 384 to 398 in 1305002
Here we see
createAndUpdateBatches()
is called firstOpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Line 224 in 1305002
In this function, we assign a batchId for all unassigned shards, and create a new one if required.
At this point, every unassigned shard would have a valid batch id.
Then we call
makeAllocationDecision()
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Line 396 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 104 to 112 in 1305002
This creates a fetchData supplier, and passes it to
getUnassignedShardAllocationDecision()
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 171 to 197 in 1305002
Line 183 checks
hasInitiatedFetching()
, sinceexplain=true
and deciders are returning NO:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 183 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 518 to 521 in 1305002
hasInitiatedFetching()
checks if a shard has a valid batch ID, which will always be true at this point, as we have previously established.So we go into this block and call
fetchData()
:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 192 to 195 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java
Line 180 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java
Lines 257 to 274 in 1305002
AsyncShardFetch.fetchData()
triggers anasyncFetch()
and returns the fetcher object. At this point, the fetching would have just started in async manner. So we don't have any data from the nodes yet, so thenodeShardStores
is null.OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 194 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardAllocator.java
Lines 231 to 247 in 1305002
Since
nodeShardStores
is null, we returnFETCHING_SHARD_DATA
as the final result, which internally maps toAWAITING_INFO
. This is why we see theAWAITING_INFO
result instead of a NO result in allocation explain API the first time.Eventually, when all the nodes respond and the async fetch finally completes (could be seconds or minutes later),
onResponse()
ofasyncFetch()
is called which callsprocessAsyncFetch()
Here we trigger a new reroute, this time with
explain=false
:OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java
Lines 220 to 235 in 1305002
This eventually calls
allocateUnassignedBatch()
- decides which shards are eligible or ineligible and calls fetchData accordingly.OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 120 to 138 in 1305002
To decide if the shard is eligible, it calls
getUnassignedShardAllocationDecision()
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 171 to 197 in 1305002
This time, this expression evaluates to TRUE because
explain=false
, so the function returns NO, and the shard is marked ineligible.This is expected behaviour - when deciders return NO, the shard should be ineligible, so this is fine.
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 183 in 1305002
After this loop completes, we call
InternalReplicaBatchShardAllocator.fetchData()
, with this shard being marked as ineligible:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 138 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 484 to 499 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Line 524 in 1305002
In the above code, we remove all the ineligible shards from the batch, which also wipes the shard entry from the batch cache:
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 634 to 640 in 1305002
So any information about this shard is now removed from the batch cache. So all the work from the previous fetchData also goes to waste.
So when we call allocation explain a second time, the cache will be empty, we will again restart the whole flow and return
AWAITING_INFO
forever...Flow for manual reroute
The flow for manual reroute is affected in a very similar way.
When we call manual reroute API, the first API call is made with
explain=true
. This is hardcoded in the logic for manual reroutes.Assume again that deciders are returning NO decision for our replica shard for any reason.
We will start with this entry point for the flow:
OpenSearch/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Lines 197 to 221 in 1305002
We check for replica shard eligibility here:
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 126 to 135 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Lines 183 to 191 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 183 in 1305002
Now because
hasInitiatedFetching
returns true, the entire expression above returns False and the functiongetUnassignedShardAllocationDecision()
returns null.So we mark the shard as eligible.
** This is wrong. At this point, we already know the deciders are returning NO, so the shard should not have been marked eligible **
Because we incorrectly mark the shard as eligible, we trigger
fetchData()
here, which eventually callsAsyncShardFetch.fetchData()
:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Line 138 in 1305002
This again triggers a second reroute, this time with
explain = false
[hardcoded].Now in the second reroute, because
explain = false
, the shard is correctly marked as ineligible and is wiped from the cache.So essentially we triggered the first fetchData and second reroute for no reason - The first reroute should have returned NO decision without ever calling fetchData, and we should have never called a second reroute.
FIX
We need to fix
hasInitiatedFetching()
function to correctly check if fetching has happened at least once or not.The implementation in non batch mode is correct so we see these issues.
The only case where we SHOULD call
AsyncShardFetch.fetchData()
for decision = NO, is when we already have the data from all nodes available in the cache. This way we know that when we callAsyncShardFetch.fetchData()
, it won't trigger a newAsyncShardFetch.asyncFetch()
because all the nodes in the cache will have an entryOpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java
Line 146 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java
Lines 172 to 181 in 1305002
Essentially, in allocation explain or reroute flow for shards with decision=NO, we should only call
AsyncShardFetch.fetchData()
when we know that the cache already has the data, so we can guarantee a newasyncFetch
would not get triggered for this ineligible shard.The only point of using the data from the node cache in this case, is that we can also populate shard store info for each node along with the NO decision.
When we call
getAllocationDecision()
with this non empty cache, this function will augment the number of matching bytes that are there on each node for a specific shard. This gets appended to the NO decision here:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardAllocator.java
Line 231 in 1305002
OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardAllocator.java
Lines 271 to 287 in 1305002
Check
augmentExplanationsWithStoreInfo()
for more details about this:OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardAllocator.java
Lines 377 to 393 in 1305002
Related component
Cluster Manager
To Reproduce
Expected behavior
Allocation explain API should return NO and not get stuck in AWAITING_INFO
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: