-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633
Comments
Pinging @elastic/es-distributed |
I muted this test method with #38634 on master, 7.x and 7.0 |
There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633
There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633
@bleskes @dnhatn @DaveCTurner @martijnvg @ywelsch I think that I can explain what is occurring in this test, and what is occurring here is a blocker issue. This issue is the result of adding recovery from remote. With recovery from remote, we copy over the index files and there is no translog replay phase. If the leader does a flush before the follower initiates recovery from remote, after recovery from remote the follower will be fully caught up all primary shards of the follower will have empty translogs. When a replica shard of the follower attempts to recover from the primary shard of the follower, we want to replay translog from the local checkpoint of the commit, to bake a history of operations for it. Since the primary shards of the follower have empty translogs, this replay can not happen and recovery will fail. That explains the following:
Immediately, I see two options:
I am going to mute this test for now and open a new issue marked as a blocker. |
I have marked this test as awaiting a fix in 6.7, 7.0, 7.x, and master. |
@jasontedor Thanks for digging into this. @martijnvg and I found this issue in #38949. I am working on the fix for this - it's pretty contained. We can use the local checkpoint of the recovery target to validate the sending operations instead of tracking them locally on the recovery source. |
Is this caused (or exacerbated) by #38904? Prior to #38904 if soft deletes were enabled we did read historical operations from Lucene during peer recovery, and I think the follower primary would have retained that history in Lucene for the duration of the recovery. I ask because I note that the first report of this failure comes from a time before when #38904 was merged. |
Yes that exacerbates it. |
There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633
This was once mentioned in #33337 but since its
a different method and its not muted yet I'll open a separate issue to track it.
Latest failures eg: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/780/console
Doesn't reproduce locally, but this is the reproduction line
Errors in the log:
The text was updated successfully, but these errors were encountered: