-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Translog] Phase 1 - Fail requests on isolated shard if no successful leader checker call in last x minutes to enable auto restore #6737
Comments
This is a good proposal.
Is checking for cluster block is sufficient? What if the block is removed by the time the response is returned?
How will you achieve this? Your proposal assumes that leader checker is also a strict leasing system, in addition to fault detection. By design, it seems so. But I am not sure if implementation was done with this guarantee in mind. Is there any use-case where we use leader checker as a leasing system? |
This check is different and is using the latest leader check time to determine whether to acknowledge the request or fail it. This is not the same cluster block. Cluster block trigger is reactive in nature. GC pauses can lead to cluster block not even getting applied. Hence we need to rely on latest successful leader check time.
We will put setting validator to ensure that the auto restore time is greater than the custom leader checker time threshold. Please not this leader checker time threshold is the new threshold that is applicable for remote translog backed indexes only.
Fault detection constitutes of Leader check and Follower checker. Both in tandem currently handles addition and removal of node and consequent shard reallocation. Currently, the global level cluster blocks on master are triggered on account of Leader checker failing or Follower checker informing the follower nodes about it going down. With this change, the idea is that we rely on leader checker call's timestamp to determine if we should now start failing the requests and subsequently fail the local shard as it has been considerable time since it last talked to the cluster manager. Code references of leader checker getting used to change global level cluster blocks - OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java Lines 293 to 324 in 5f81930
OnLeaderFailure - OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java Lines 304 to 312 in 5f81930
OpenSearch/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java Lines 687 to 689 in 5f81930
I have raised the PR for the first phase change - #6680. Pls do review the same. |
I have shared specific feedback on the PR. A few issues that I see:
|
Thanks for these points and reply -
Responded #6680 (comment)
This makes sense. I think we can make this as a function of following 3 settings - leader checker interval, leader checker timeout and leader checker retry count. That way, to a user, this should be implicit always and user does not need to configure on their end. I had kept this configurable to have a knob around auto restore time as well. However, it would be simpler if we could have both computed automatically using existing leader checker values.
Makes sense, will make the change.
Responded #6680 (comment)
This is a good point and there could be chances that this can happen. However, this change would not cause any correctness issues. In cases like this, we will fail the request (i.e. not acknowledge), however we might have the data present. We are making sure that an acknowledged request does not go missing. Also, as a follow up we can enhance follower checker to trigger leader checker to make sure that leader checker gets revived if the background thread dies silently. In addition to the trailing points, in the follow up phase the plan is to fail the shard. This will lead to no more acknowledgement of requests for the failing shard and also if the node can reach the cluster manager, it will lead to either another replica becoming primary or the timer for auto restore to kick in.
Given we are defensively failing the requests for more cases similar to above, I think it should be safe even when considering network isolation. I have also considered asymmetric network partitions (cc @shwetathareja), and they are getting handled with this. |
Thanks @ashking94 for the proposal. It looks interesting and thinking more to see if it can give correctness guarantees for durability. Couple of points to discuss
So in nutshell, i see bunch of challenges with this approach to provide correctness guarantees for durability. |
Leasing checks are pretty weak(leader checks just checks on terms, version can be stale) While for GC and N/W failure most of the leader check can serve as a leasing system but it is not guaranteed. Bugs in the system would tend to break guarantees. Would suggest, we invest in hardening leases rather than focusing on what is guaranteed by the existing system(It makes no leasing guarantees today). |
Thanks everyone for the comments. After thinking more about this, for the problems mentioned above, we would need to build a strict leasing system which works by acquiring lease (or locks). Currently cluster manager is responsible for communication cluster state changes to everyone in the cluster. From there on primary term validation helps to identify stale primaries and fail them. This might need investment in making the cluster manager also act as a distributed lock/lease system and then every primary shard will need to acquire/renew/release locks from the elected active cluster manager. We will circle back on this as without this auto restore might be tough. |
[Remote Translog] Fail requests on isolated shard
Problem statement
Document type replication
For document replication backed index, if a shard is configured with one or more replicas, an isolated primary can realise that it is no more the current primary when it fans out the incoming request to all the replicas and the replica does primary term validation. If no replicas are configured, the isolated primary can continue to accept the incoming writes depending upon the
cluster.no_cluster_manager_block
setting. Ifcluster.no_cluster_manager_block
iswrite
, then indexing requests would be returned with HTTP status code 5xx. Ifcluster.no_cluster_manager_block
ismetadata_write
, then metadata writes would be rejected but indexing would continue to happen forever. Since there is no auto recovery of the failed primary (without replicas), this would require manual intervention for joining the isolated primary back to the cluster.Remote backed index
For an index backed with remote store, if a shard is configured with one or more replicas, an isolated primary can still realise that it is no more a primary while doing the primary term validation. This will ensure that the indexing requests are not acknowledged when cluster manager promotes an in-sync replica to the primary. However, when a shard is configured with no replicas, if
cluster.no_cluster_manager_block
ismetadata_write
, then the isolated primary can continue to accept indexing requests forever. In remote store vision, we also plan to have auto restore of red index from remote store as remote-backed indexes have request level durability. However, a shard with no replica hinders the auto restore of red index as the isolated primary can continue to accept the writes and acknowledge the same while the new primary shard also restores from remote store and starts accepting writes. The isolated primary continues to accept the writes when it is acting as the coordinator. If the request falls on a node which is not isolated, then it would forward the request to the right primary shard.If the master block setting
cluster.no_cluster_manager_block
is set towrite
then it would start failing the writes after LeaderChecker on the concerned node has exhausted all retries and has applied global level cluster block. This ensures that there are no new requests that would be accepted. However, if there were requests accepted before the global level cluster block was applied and then there are long GC pauses, it is possible that auto restore kicks in before the request is acknowledged. This can lead to data loss.Note - Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.
Approach
cluster.no_cluster_manager_block
towrite
for remote translog backed indexes.The above approach has most benefit when there are no more active replicas or the replica configured is 0.
Execution
Planning to implement this in 2 phases -
Usecases
Considerations
References
The text was updated successfully, but these errors were encountered: