-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uses auto generated timestamp with soft-deletes #33656
Comments
Pinging @elastic/es-distributed |
I don't think we can do that. We don't know if the original request had such a timestamp or not. if we assign it to the wrong value we might miss an update. I think we should rather not try to optimize the peer recovery case and make sure we propagate the
I wonder if we can be smarter here. Today we can only optimize for append-only if the ID was autogenerated. Yet, for the following engine that's a different game. At the moment we fetch docs from the leader we could (if we had the information what was the highest seqID that actually did a |
Yes, I am good with this since we don't have to worry about the correctness.
Wow, superb idea. Thanks so much @s1monw. |
We will implement an optimization using sequence number on the following engine with these sub-tasks:
|
This change adds "contains" method to LocalCheckpointTracker. One of the use cases is to check if a given operation has been processed in an engine or not by looking up its seq_no in LocalCheckpointTracker. Relates #33656
This change adds "contains" method to LocalCheckpointTracker. One of the use cases is to check if a given operation has been processed in an engine or not by looking up its seq_no in LocalCheckpointTracker. Relates #33656
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656
We start tracking max seq_no_of_updates on the primary in elastic#33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates elastic#33656
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates elastic#33656 Relates elastic#33222
We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656
This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates elastic#33656
This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656
This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656
We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656
This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656
This change introduces the indexing optimization using sequence numbers in the FollowingEngine. This optimization uses the max_seq_no_updates which is tracked on the primary of the leader and replicated to replicas and followers. Relates #33656
This change introduces the indexing optimization using sequence numbers in the FollowingEngine. This optimization uses the max_seq_no_updates which is tracked on the primary of the leader and replicated to replicas and followers. Relates #33656
All subtasks have been integrated. |
This change adds "contains" method to LocalCheckpointTracker. One of the use cases is to check if a given operation has been processed in an engine or not by looking up its seq_no in LocalCheckpointTracker. Relates #33656
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656
We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656
This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656
This change introduces the indexing optimization using sequence numbers in the FollowingEngine. This optimization uses the max_seq_no_updates which is tracked on the primary of the leader and replicated to replicas and followers. Relates #33656
A CCR test failure shows that the approach in #34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates #34474 Relates #33656
A CCR test failure shows that the approach in #34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates #34474 Relates #33656
A CCR test failure shows that the approach in #34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates #34474 Relates #33656
…ic#35230) A CCR test failure shows that the approach in elastic#34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates elastic#34474 Relates elastic#33656
1. Peer-recovery
Today we don't store the auto-generated timestamp of indexing operations in Lucene and always assign -1 to all index operations from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica in the following test.
To fix this, we need to assign a value which is at least the (original) timestamp of the index request to its corresponding index request from LucenChangeSnapshot. Here we can use the latest auto-generated timestamp of Engine.
2. Optimize indexing on a FollowingEngine in CCR
We disable optimization for index requests whose origin are recovery (retry always is true). To enable this optimization in CCR:
We need to make sure that a FollowingEngine processes an append-only operation once. This can be done using LocalCheckpointTracker.
We need to store the retry flag to Lucene index and extend Translog#Index to include this flag. This should be fast with a single value DocValues.
@s1monw WDYT? /cc @bleskes
This a subtask of #30086.
The text was updated successfully, but these errors were encountered: