-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Async shard fetches taking up GBs of memory causing ClusterManager JVM to spike with large no. of shards (> 50k) #5098
Comments
I tried doing a repro of the issue, wherein I found that approx 82-83% of heap is being used by GatewayAllocator. Steps of repro:
On analysis found that DiscoveryNode is part of response of AsyncShardFetch and is consuming considerable amount of memory. |
There can be another improvement area around the cache strategy used in AsyncFetch. It occupies significant chunk of master's memory (about 60%) in case of large cluster restarts. The Cache currently stores two types of data, Step of repro were same as here: #5098 (comment) |
Today, considering responses from all the nodes are cached for an unassigned shard so that GatewayAllocator can generate the complete picture. It caches the empty response as well which can be replaced by a dummy object. This will reduce the DiscoveryNode class overhead significantly from n (nodes) * m (unassignedshards) to m only. OpenSearch/server/src/main/java/org/opensearch/gateway/TransportNodesListGatewayStartedShards.java Line 218 in 197086f
|
Overall ProposalContextDuring node-left and node-join we do a reroute for assigning unassigned shards. To do this assignment Cluster Manager fetches the shards metadata in async manner on per shard basis. This metadata contains the information about the shard and its allocation. We need this to take a new decision on allocation. reroute is the crucial method call of AllocationService which takes care of this flow for whole cluster.
Once we receive the data from all the nodes (all nodes relevant for the shardId), we build allocation decision and perform reroute. [I’ll explain code flow in detail upcoming section] ProblemIn case there are 50K/100K shards, during cluster restart scenarios, we end up making transport calls for each and every shard. And all transport threads are doing this same work(same code flow for different shardId) which chokes the transport and it make the cluster unstable. OOM kill is one of the results of this situation as there are back to back calls. Acronyms : Current flowLet’s understand the code flow a bit better. Node left scenarioCoordinator.Java has a method OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java Lines 318 to 320 in 90678c2
execute method of this particular executor triggers AllocationService Line 123 in 90678c2
Above method unassign shards that are associated with a node which is not part of the cluster. Executor always pass reroute flag as true. This method first fail all the shards present on bad nodes, then triggers reroute method of AllocationService. OpenSearch/server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java Lines 317 to 318 in 90678c2
reroute calls internal private reroute method which triggers the allocation of existing unassigned shards. OpenSearch/server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java Line 539 in 90678c2
OpenSearch/server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java Lines 544 to 570 in 90678c2
Now for each primary and replica shard, allocateUnassigned method of GatewayAllocator class is called which triggers
innerAllocate calls allocateUnassigned method of BaseGatewayShardAllocator which tries to make allocation decision: OpenSearch/server/src/main/java/org/opensearch/gateway/BaseGatewayShardAllocator.java Line 76 in 90678c2
makeAllocation decision is written separately for primary and replica in PSA & RSA respectively. PSA calls fetchData method of IPSA(written in GA file) to find the metadata about this shard from data nodes using transport call: OpenSearch/server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java Line 114 in 90678c2
GA maintains a map where key is shardId and value is a fetcher object, this fetcher object is used for actually making the transport call and fetch the data. OpenSearch/server/src/main/java/org/opensearch/gateway/GatewayAllocator.java Lines 284 to 297 in 90678c2
fetchData method is written in AsyncFetchShard class, as this is an async method, we intelligently triggers reroute in the async response of this method. Once data is returned, reroute will trigger the flow but won’t call data nodes again as it’ll see that required data is already received in the cache. Because we keep looking in the cache and check which are remaining for fetching using following call within fetchData method:
Based on the result of this data - AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards> shardState OpenSearch/server/src/main/java/org/opensearch/gateway/BaseGatewayShardAllocator.java Lines 83 to 91 in 90678c2
Node join scenarioJoinTaskExecutor has the code for the actual Task which gets executed on node-join, it triggers a reroute as part of its flow in case cluster manager is changed or a new node joins. In case cluster state is updated with existing nodes, as part of applying the cluster state IndicesClusterStateService triggers shard action - shart/started or shard/failure. Both started and failed executors(ShardStateAction class) have a handler in case cluster state is published from cluster manager and triggers reroute: OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/JoinTaskExecutor.java Lines 206 to 209 in 90678c2
Shard StartedOpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java Lines 829 to 836 in 90678c2
Shard FailedOpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java Lines 546 to 559 in 90678c2
So we know ultimately it’ll trigger the whole flow of assigning unassigned shards. SolutionKey bottleneck we found by profiling is too many calls of the same fetchData transport(too many serialization/deserialization) as well as the in memory load of too many DiscoveryNode objects once response is received and until it is garbage collected. *More analysis of heap dump needs to be done for understanding more data points on memory side. * Phase 1: Batch calls for multiple shardsFirst solution is to batch call for multiple shards which is being done 1 per shard today. This will contain different request type as we take multiple data structures similar to what we send today for 1 shard. Response structure will also change accordingly. This batching can be done on node level or with a fix batch count depending on performance or failure scenarios. Phase 2 : Reduce Discovery Node objectsWe must reduce DiscoveryNodeObjects as they are being created via new when we receive the response as part of code :
Code inside super
Part 1:We reduce the number of these objects based on the number of calls we make - for example we make a batch of 100 and send 1000 calls. So, only 1000 DiscoveryNode objects will be present depending on for which node we’re sending the call. This way we reduce DN objects as well as save on ser/desr work. We also need to keep a version check(>3.0) for destination node so BWC is maintained(as we know older version nodes won't have this handler). Changes required We need to change the flow starting from AllocationService to all the way down to the transport call (fetchData). Surrounding changes :
Core changes required for new transport call :
will change to a different type like
Changes for failure handling or correctness
OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java Lines 261 to 270 in 90678c2
Part 2:We keep the DiscoveryNode objects limited to the number of nodes exactly. As we know only fixed number of nodes should exist or join/drop the cluster, we keep some common cache for all this particular information and refer it(instead of doing new) whenever needed based on ephemeralId or some other relevant id. This way we can save on memory. As first solution itself is giving some relief in number of DiscoveryNode objects, we can debate more on if this is truly required based on gains diff of part 1 and part 2. ImpactA benchmarking was run to see that we can scale well with batching approach. Cluster can scale up to 100-120K shards without facing any OOM kill where current code faces OOM kill with 45K shards. [more details on benchmarking will be attached below] Note : Current POC code also contain changes where we don’t rely on BaseNodeResponse class so number of DiscoveryNode objects are reduced to number of calls only and right now that is per node, so part2 of the phase 2 solution is automatically covered. But based on discussion on these approaches, I'll write actual code Pending Items: Note 2: Integration tests are failing, need to look at those. AI: Also need to explore if we need to change the GA async fetch call to generic threadpool instead of using clusterManager updateTask |
@sachinpkale @Bukhtawar @andrross @shwetathareja looking for feedback on proposal. |
@sachinpkale @Bukhtawar @andrross @shwetathareja gentle reminder. |
@reta - would love some feedback. |
Thanks @amkhar for explaining the problem in detail. Couple of points on the Phase 1 solution for batching (more around implementation):
Regarding Phase 2 for reducing no. of discovery node objects: Be careful when changing
This also means you will need to keep per shard fetching marked as deprecated in 2.x and can be removed in 3.0 only.
will it fetch only for failed shards or all the shards in that batch? What is the current retry mechanism?
Today, the cache can get per shard view easily, Now it has to iterate over all entries in the top level map with NodeId as the key to check the shard status. Right?
How would it impact |
Thanks Aman, per node calls make sense, I am assuming this is only being done for assigning primary shards first and then replica shards subsequently. It would be easier to review the code once you have the draft ready |
Yes.
We have this cache object today, it'll be utilized to know which shards are already being fetched for a particular node.
Yes
No and No
We've this piece of code per shard which is being executed on the data node. OpenSearch/server/src/main/java/org/opensearch/gateway/TransportNodesListGatewayStartedShards.java Lines 153 to 218 in 03bc192
A for loop is required to run it for multiple shards, I don't think running this in parallel new threads is a good idea. So, yes, we can add a timeout for per shard execution, if any of the shard fetch code execution(one execution of loop code) takes time, we can mark this as failed shard with exception type as OpenSearchTimeOut and return the response accordingly.
Even today, we don't assume any time outs. Listener gets a response in In upcoming change : As we'll have shardEntry in the cache (mostly concurrent), if a request is timed out or we received a full failure from node, here on leader side, we'll reset the shardEntry so it can be re-fetched by next set of calls (until then new calls won't be made for those shards, only after we update this cache{source of truth for us}).
Not possible, as discussed above. If the shardId is same, we'll send the new request, only after marking previous one as failed or updating the common cache object as restartFetching.
They will be discarded, current logic does it for nodeEntries, we'll do same thing for shardEntries. Only change is - today in per shard cache, we used to mark one entry of HM as fetching=false, now we'll need to do it for a batch of entries of HM for each shard in our newly created per node cache.
As the cache contain different entries for every shard, two different request for different set of shards can easily update their respective set of entries in the map and we'll update the response only after all the shards are fetched from that particular node.
Yes, partially count of DiscoveryNode overall in memory will be automatically reduced as number of responses are reduced. Basically, if we're returning the responses from 1 node only, we don't need to keep the BaseNodeResponse class as parent class of response object. Because we know it's the same DiscoveryNode object for all shards (which are being sent in 1 single request). As I mentioned previously #5098 (comment)
Phase 2 will only be applicable in case we set the batchSize as 100 and there are 1000 shards on the node. Now we may end up calling new transport 10 times and get 10 DiscoveryNode objects for the same node. So, to optimize even that we could see if we can just keep it 1 DiscoveryNode object per node and not store every response's DiscoveryNode object in memory if ephemeralId or uniqueId for the node is same(then it can be referred from or updated in some cache maintained on cluster manager side).
For now I only see one implementation and this method is not overridden there. I think if a new ExistingShardsAllocator implementation is provided by new plugins, that need to add extra methods which we're planning to add now (allocation of shards in a batch per node).
Handling of this will be difficult and make the code complicated (imagine all book keeping done for both type of requests). So it'll be easier to enable batched request only when all nodes have the handler for batch request. Else we'll need to write more code just to handle the different type of responses and merge their responses to get a final view :)
Current retry mechanism marks which node returned failure, based on that in next round only those nodes are picked up. For failures, we'll mark only specific shards as failed in cache if only some of those are failed while fetching from a node. We won't fail the whole batch. Whole batch will be failed only in case node itself return failure or gets disconnected.
Cache is only being used for book keeping purpose today to know which nodes are fetching or fetching is done/failed etc. We also store the result in cache that is a map where key is nodeId and value is NodeEntry so for each node where we're sending the request we keep the result in cache. Then overall decision is made base on this map. Now for making an actual allocation decision, we'll need a view of So, we do need to traverse outer node level map and build internal shard level view to make final decision (as it's done today in PSA). OpenSearch/server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java Lines 132 to 182 in 87a833f
|
Thanks @amkhar for collating the knowledge and attaching my Draft PR for it. So here are few points that I will like to clarify:
I think you mean to say the diff should be between the atomic set you maintain where fetch is in being progress and the Unassigned Shards list maintained by RoutingNodes here. So once fetch gets completed for current batch, we clear it from our atomic set. And if the Allocators(PSA/RSA) are unable to make decision with current available data, shards will still be in unassigned state. And the next round of batch(reroute call) will pick it as a diff.
I think you are confusing with the use of cache here, cache is store intermittant response from nodes as and when it is received over transport, the key will still be NodeId(String). What you can do to maintain the diff, is use the current batchOfShards request object that is being sent to AsyncShardFetch class to get current fetching shards, that can be atomic set
In current scenario: 1- First type of failures happen, when during the the transport any catastrophic event happens. These failures are restarted again by doing another round of reroute OpenSearch/server/src/main/java/org/opensearch/gateway/AsyncShardFetch.java Lines 184 to 189 in 17ee1ce
2- For second failures, The respone NodeGatewayStartedShard itself store the exception OpenSearch/server/src/main/java/org/opensearch/gateway/TransportNodesListGatewayStartedShards.java Lines 349 to 353 in 17ee1ce
These exception are handled by PSA/RSA accordingly. OpenSearch/server/src/main/java/org/opensearch/gateway/PrimaryShardAllocator.java Lines 344 to 361 in 17ee1ce
We should not change this existing logic, since even if we have batch, in case of catastrophic physical Failures( category-1) we need to restart entire batch, because that node will come under Category 2 will be handled by PSA/RSA as it is done now. |
@shwetathareja @amkhar I still think that our main culprit here is GatewayAllocator fetching part which eats memory and has overheard of more number of transport calls. Current @amkhar solution is trying to revamp the Assignment part which is not our culprit. We could make separation of concerns here by separating fetching part and Abstracting it out in different new interface. We build the view of data by fetching (with the current proposed way of @amkhar that handles failures and have retries, i.e better version of my draft PR) and send the data to PSA/RSA per shard as it is done today. The downside is we have to loop over data from all nodes(which will be finite in number and in general less than number of shards) to get the data of required shard that PSA/RSA wants like I have demonstrated in a crude way in my POC code: OpenSearch/server/src/main/java/org/opensearch/gateway/GatewayAllocator.java Lines 419 to 439 in aedd103
If you look high level, the data collection part is still abstracted from PSA/RSA. By with my idea we are changing the abstraction layer of fetching from single shard to batch of shards. This way will save from making any change in current interfaces and abstract classes and avoid any breaking changes. |
@amkhar There could be custom plugin implementation that customers might be using in their private repos. And, you can't introduce a breaking change in 2.x where they need to implement a new method in the interface. |
Today, it runs in parallel for all unassigned shard by the virtue of per shard calls. What is the latency overhead for fetch this detail on data node per shard? |
One suggestion you can run a loop on existing allocateUnassigned method from current interface in loop as the default implementation for the new batch method in |
Benchmarking result of POC code[#7269 ]
|
@Bukhtawar thanks for your response. As this is a bigger change, we're raising small sub PRs to make the reviews easy. This is the order of PRs which I think we'll follow for merging in main.
|
@shwetathareja |
@amkhar What is the backport plan? Will each of these PRs be safe to backport to 2.x immediately upon merging to main? Also, can you tag these issues/PRs with the intended version (like v2.10.0), or let me know if you don't have permissions to add labels. |
@andrross
I've added all subtasks of this bigger project here : I don't have permissions to add labels, yes intended version is 2.10.0. Please attach that label or add permissions for me to attach labels. |
@vikasvb90 @dhwanilpatel - requesting for feedback on the PRs. |
Describe the bug
GatewayAllocator performs async fetch for unassigned shards to check if any node has shards on its local disk. These async fetches are done per shard. When nodes are dropped in the cluster, it results in unassigned shards and GatewayAllocator performs async fetches when nodes try to join back. In a cluster which has large no. of shards in 10s of thousands it results in too many async fetches causing cluster manager JVM to spike.
In one of the cluster which had > 50K shards, async fetches was taking 13GB memory.
Expected behavior
These async fetches should be batched per node instead of being executed per shard per node.
OpenSearch Version : 1.1
Additional context
Also, these async fetches were discussed earlier in the context of taking up lot of CPU while sending these requests during reroute operation. elastic/elasticsearch#57498 It should also be evaluated if these requests should be sent from different threadpool instead of blocking the
masterService#updateTask
which is single threaded and needed for processing cluster state updates faster.The text was updated successfully, but these errors were encountered: