Migrate peer recovery from translog to retention lease #49448

dnhatn · 2019-11-21T16:04:36Z

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL.

Relates #45136

elasticmachine · 2019-11-21T16:04:38Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

dnhatn · 2019-11-21T22:35:00Z

Hmm, a new test is failing. I am looking at it.

qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java

dnhatn · 2019-11-27T03:13:34Z

I have an implementation that fallbacks to translog if an index was created before 7.4, and the recovering replica does not have a PRRL. I think we should disable translog retention after every copy has established its PRRLs. However, this would require coordination. Another option is to make this decision locally. We also need to persist this decision so that we won't re-enable translog retention in a full cluster restart. WDYT?

ywelsch · 2019-11-27T09:24:38Z

ReplicationTracker already has this field hasAllPeerRecoveryRetentionLeases. Maybe we can use that to make this decision locally?

dnhatn · 2019-12-02T04:07:01Z

Please hold off the review as the test failure relates to this change. I will ping after I have resolved it.

dnhatn · 2019-12-02T16:24:53Z

run elasticsearch-ci/packaging-sample-matrix

ywelsch

Great work, Nhat! Overall looking very good already. I've left some minor comments.

server/src/main/java/org/elasticsearch/index/IndexSettings.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

ywelsch

LGTM

dnhatn · 2019-12-13T18:56:03Z

@ywelsch Thanks for reviewing.

We turn off the translog retention policy asynchronously using the generic threadpool; hence, we need to assert busily here Relates #49448

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates #45136

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates elastic#45136

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates #45136

We need to make sure that the global checkpoints and peer recovery retention leases were advanced to the max_seq_no and synced; otherwise, we can risk expiring some peer recovery retention leases because of the file-based recovery threshold. Relates #49448

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL. To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL. Relates elastic#45136

We turn off the translog retention policy asynchronously using the generic threadpool; hence, we need to assert busily here Relates elastic#49448

Allow ops-based recovery without existing retention lease

b4410f7

dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.6.0 v7.4.3 v7.5.1 labels Nov 21, 2019

dnhatn requested review from ywelsch and DaveCTurner November 21, 2019 16:04

Merge branch 'master' into migrate-to-prrl

e3d51d6

ywelsch reviewed Nov 22, 2019

View reviewed changes

qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java Outdated Show resolved Hide resolved

Merge branch 'master' into migrate-to-prrl

a4d5d35

dnhatn mentioned this pull request Nov 24, 2019

Use retention lease in peer recovery of closed indices #48430

Merged

dnhatn added 3 commits November 26, 2019 10:25

undo

7deb9aa

introduce history source

5af5e22

add TODO

026e829

dnhatn added 3 commits November 28, 2019 10:52

Merge branch 'master' into migrate-to-prrl

52615d8

Merge branch 'master' into migrate-to-prrl

848ed8d

keep track use retention leases in IndexShard

1299d59

dnhatn changed the title ~~Allow ops-based recovery without existing retention lease~~ Migrate peer recovery from translog to retention lease Dec 1, 2019

check for soft-deletes

e2be375

one more soft-deletes condition

ed77c88

dnhatn requested a review from ywelsch December 2, 2019 16:25

ywelsch suggested changes Dec 3, 2019

View reviewed changes

dnhatn added 2 commits December 12, 2019 21:11

Merge branch 'master' into migrate-to-prrl

8bd19a5

Only disable translog when all copy has PRRL

d747590

dnhatn requested a review from ywelsch December 13, 2019 06:07

include relocation target

5581923

ywelsch approved these changes Dec 13, 2019

View reviewed changes

dnhatn merged commit b9fbc8d into elastic:master Dec 13, 2019

dnhatn deleted the migrate-to-prrl branch December 13, 2019 18:56

dnhatn added backport pending v7.5.2 and removed v7.4.3 v7.5.1 labels Dec 13, 2019

dnhatn mentioned this pull request Dec 14, 2019

Add 7.5.1 release notes. #50196

Merged

dnhatn added a commit that referenced this pull request Dec 15, 2019

Fix testTurnOffTranslogRetentionAfterAllShardStarted

0de7464

We turn off the translog retention policy asynchronously using the generic threadpool; hence, we need to assert busily here Relates #49448

dnhatn mentioned this pull request Dec 16, 2019

Migrate peer recovery from translog to retention lease #50211

Merged

dnhatn removed the backport pending label Dec 16, 2019

jasontedor added v7.5.1 and removed v7.5.2 labels Dec 16, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate peer recovery from translog to retention lease #49448

Migrate peer recovery from translog to retention lease #49448

dnhatn commented Nov 21, 2019 •

edited

Loading

elasticmachine commented Nov 21, 2019

dnhatn commented Nov 21, 2019

dnhatn commented Nov 27, 2019

ywelsch commented Nov 27, 2019

dnhatn commented Dec 2, 2019

dnhatn commented Dec 2, 2019

ywelsch left a comment

ywelsch left a comment

dnhatn commented Dec 13, 2019

Migrate peer recovery from translog to retention lease #49448

Migrate peer recovery from translog to retention lease #49448

Conversation

dnhatn commented Nov 21, 2019 • edited Loading

elasticmachine commented Nov 21, 2019

dnhatn commented Nov 21, 2019

dnhatn commented Nov 27, 2019

ywelsch commented Nov 27, 2019

dnhatn commented Dec 2, 2019

dnhatn commented Dec 2, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Dec 13, 2019

dnhatn commented Nov 21, 2019 •

edited

Loading