Fix trimUnsafeCommits for indices created before 6.2 #57187

dnhatn · 2020-05-27T05:51:12Z

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation.

Closes #57091

elasticmachine · 2020-05-27T05:51:14Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

dnhatn · 2020-05-27T06:00:13Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        try {
            maybeCheckIndex(); // check index here and won't do it again if ops-based recovery occurs
            recoveryState.setStage(RecoveryState.Stage.TRANSLOG);
+            if (safeCommit.isPresent() == false) {
+                assert globalCheckpoint == UNASSIGNED_SEQ_NO || indexSettings.getIndexVersionCreated().before(Version.V_6_2_0) :


The new test found this issue where the index is synced flush, but the global checkpoint is still unassigned.

I'm not sure I understand what issue is being addressed here. How is moving this condition further down (after maybeCheckIndex) helping?
Is the issue that we have not properly moved to the translog stage?

Also, which part of the new tests show this issue, and is it something that can also be triggered with a single restart?

Sorry, I should have explained better.

Is the issue that we have not properly moved to the translog stage?

That's correct. Previously, we do not move the recovery stage from INDEX to TRANSLOG if we don't have the safe commit, which can be the case if the index was created before 6.2 or the global checkpoint is still unassigned. Here we expect a file-based recovery to happen, and we will move the recovery stage to TRANSLOG in the clean files step. However, if the shard has a synced flush, we won't execute the clean files step and trip the assertion.

Also, which part of the new tests show this issue, and is it something that can also be triggered with a single restart?

Yes, I will add it to the full cluster restart suite.

ywelsch

I've left one comment, o.w. looking good.

ywelsch · 2020-05-27T06:57:30Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        try {
            maybeCheckIndex(); // check index here and won't do it again if ops-based recovery occurs
            recoveryState.setStage(RecoveryState.Stage.TRANSLOG);
+            if (safeCommit.isPresent() == false) {
+                assert globalCheckpoint == UNASSIGNED_SEQ_NO || indexSettings.getIndexVersionCreated().before(Version.V_6_2_0) :


I'm not sure I understand what issue is being addressed here. How is moving this condition further down (after maybeCheckIndex) helping?
Is the issue that we have not properly moved to the translog stage?

ywelsch · 2020-05-27T12:19:41Z

Test failure here also looks relevant

dnhatn · 2020-05-27T13:57:47Z

@ywelsch It's ready again.

ywelsch

LGTM. Thanks Nhat!

dnhatn · 2020-05-27T15:04:50Z

Thanks Yannick.

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes elastic#57091

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091

If the previous peer recovery failed after copying segment files, then the safe commit invariant won't hold in the next recovery. Relates #57187

If the previous peer recovery failed after copying segment files, then the safe commit invariant won't hold in the next recovery. Relates elastic#57187

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091

If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. This issue was addressed in #57187, but not forward-ported to 8.0. Closes #57708

Backport of elastic/elasticsearch#57187 Fixes #11756.

Backport of elastic/elasticsearch#57187 Fixes #11756. (cherry picked from commit c05620a) # Conflicts: # docs/appendices/release-notes/unreleased.rst # server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

Backport of elastic/elasticsearch#57187 Fixes #11756. (cherry picked from commit c05620a)

Fix trimUnsafeCommits for indices created before 6.2

c60082b

dnhatn added >bug blocker :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.8.0 v7.7.1 v7.9.0 labels May 27, 2020

dnhatn requested a review from ywelsch May 27, 2020 05:51

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 27, 2020

dnhatn commented May 27, 2020

View reviewed changes

ywelsch reviewed May 27, 2020

View reviewed changes

dnhatn requested a review from ywelsch May 27, 2020 12:20

dnhatn added 2 commits May 27, 2020 09:03

fix test

e5d4077

add test to full cluster restart

21a302a

ywelsch approved these changes May 27, 2020

View reviewed changes

dnhatn merged commit ba5a085 into elastic:7.7 May 27, 2020

dnhatn deleted the 7.7-translog-policy branch May 27, 2020 15:05

dnhatn added the backport pending label May 27, 2020

dnhatn mentioned this pull request May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57222

Merged

This was referenced May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57223

Merged

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Closed

dnhatn added a commit that referenced this pull request May 27, 2020

Remove assertion when safe commit not found

edc0bd9

If the previous peer recovery failed after copying segment files, then the safe commit invariant won't hold in the next recovery. Relates #57187

dnhatn added a commit that referenced this pull request May 27, 2020

Remove assertion when safe commit not found

0b94bff

If the previous peer recovery failed after copying segment files, then the safe commit invariant won't hold in the next recovery. Relates #57187

dnhatn removed the backport pending label May 27, 2020

dnhatn mentioned this pull request Jun 5, 2020

Fix recovery stage transition with sync_id #57754

Merged

seut added a commit to crate/crate that referenced this pull request Jan 6, 2022

Fix trimUnsafeCommits for indices created before 3.2

4b972f4

Backport of elastic/elasticsearch#57187 Fixes #11756.

seut mentioned this pull request Jan 6, 2022

Fix trimUnsafeCommits for indices created before 3.2 crate/crate#12017

Merged

5 tasks

seut added a commit to crate/crate that referenced this pull request Jan 6, 2022

Fix trimUnsafeCommits for indices created before 3.2

fda1b0e

Backport of elastic/elasticsearch#57187 Fixes #11756.

mergify bot pushed a commit to crate/crate that referenced this pull request Jan 10, 2022

Fix trimUnsafeCommits for indices created before 3.2

c05620a

Backport of elastic/elasticsearch#57187 Fixes #11756.

seut added a commit to crate/crate that referenced this pull request Jan 10, 2022

Fix trimUnsafeCommits for indices created before 3.2

f2e8df3

Backport of elastic/elasticsearch#57187 Fixes #11756. (cherry picked from commit c05620a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix trimUnsafeCommits for indices created before 6.2 #57187

Fix trimUnsafeCommits for indices created before 6.2 #57187

dnhatn commented May 27, 2020 •

edited

Loading

elasticmachine commented May 27, 2020

dnhatn May 27, 2020

ywelsch May 27, 2020

ywelsch May 27, 2020

dnhatn May 27, 2020

ywelsch left a comment

ywelsch May 27, 2020

ywelsch commented May 27, 2020

dnhatn commented May 27, 2020

ywelsch left a comment

dnhatn commented May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57187

Fix trimUnsafeCommits for indices created before 6.2 #57187

Conversation

dnhatn commented May 27, 2020 • edited Loading

elasticmachine commented May 27, 2020

dnhatn May 27, 2020

Choose a reason for hiding this comment

ywelsch May 27, 2020

Choose a reason for hiding this comment

ywelsch May 27, 2020

Choose a reason for hiding this comment

dnhatn May 27, 2020

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch May 27, 2020

Choose a reason for hiding this comment

ywelsch commented May 27, 2020

dnhatn commented May 27, 2020

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented May 27, 2020

dnhatn commented May 27, 2020 •

edited

Loading