Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover peers from translog, ignoring soft deletes #38904

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today if soft deletes are enabled then we read the operations needed for peer
recovery from Lucene. However we do not currently make any attempt to retain
history in Lucene specifically for peer recoveries so we may discard it and
fall back to a more expensive file-based recovery. Yet we still retain
sufficient history in the translog to perform an operations-based peer
recovery.

In the long run we would like to fix this by retaining more history in Lucene,
possibly using shard history retention leases (#37165). For now, however, this
commit reverts to performing peer recoveries using the history retained in the
translog regardless of whether soft deletes are enabled or not.

Today if soft deletes are enabled then we read the operations needed for peer
recovery from Lucene. However we do not currently make any attempt to retain
history in Lucene specifically for peer recoveries so we may discard it and
fall back to a more expensive file-based recovery. Yet we still retain
sufficient history in the translog to perform an operations-based peer
recovery.

In the long run we would like to fix this by retaining more history in Lucene,
possibly using shard history retention leases (elastic#37165). For now, however, this
commit reverts to performing peer recoveries using the history retained in the
translog regardless of whether soft deletes are enabled or not.
@DaveCTurner DaveCTurner added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v8.0.0 v7.2.0 labels Feb 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but LGTM. You need to remove or mute testRecoveryWithOutOfOrderDeleteWithSoftDeletes.

return softDeletesPolicy.acquireRetentionLock();
final Closeable softDeletesRetentionLock = softDeletesPolicy.acquireRetentionLock();
final Closeable translogRetentionLock = translog.acquireRetentionLock();
return () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use () -> IOUtils.close(softDeletesRetentionLock, translogRetentionLock) so we always close both of them.

@@ -2575,7 +2566,12 @@ public final long getMinRetainedSeqNo() {
@Override
public Closeable acquireRetentionLock() {
if (softDeleteEnabled) {
return softDeletesPolicy.acquireRetentionLock();
final Closeable softDeletesRetentionLock = softDeletesPolicy.acquireRetentionLock();
final Closeable translogRetentionLock = translog.acquireRetentionLock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should release softDeletesRetentionLock if we hit an exception while acquiring translog lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and similarly on release. I pushed 7b6c2a0.

@@ -786,4 +788,55 @@ public void sendRequest(Transport.Connection connection, long requestId, String
assertHitCount(client().prepareSearch(indexName).get(), numDocs);
}
}

@TestLogging("org.elasticsearch.indices.recovery:TRACE")
public void testHistoryRetention() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test a left-over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this test motivates this change, in the sense that it fails on master but passes on this branch.

} catch (Exception e) {
try {
softDeletesRetentionLock.close();
} catch (IOException ioexception) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make softDeletesRetentionLock Releasable so we won't have to catch here.

@DaveCTurner DaveCTurner merged commit 3f1125f into elastic:master Feb 15, 2019
@DaveCTurner DaveCTurner deleted the 2019-02-14-translog-based-peer-recovery branch February 15, 2019 08:43
DaveCTurner added a commit that referenced this pull request Feb 15, 2019
Today if soft deletes are enabled then we read the operations needed for peer
recovery from Lucene. However we do not currently make any attempt to retain
history in Lucene specifically for peer recoveries so we may discard it and
fall back to a more expensive file-based recovery. Yet we still retain
sufficient history in the translog to perform an operations-based peer
recovery.

In the long run we would like to fix this by retaining more history in Lucene,
possibly using shard history retention leases (#37165). For now, however, this
commit reverts to performing peer recoveries using the history retained in the
translog regardless of whether soft deletes are enabled or not.
DaveCTurner added a commit that referenced this pull request Feb 15, 2019
Today if soft deletes are enabled then we read the operations needed for peer
recovery from Lucene. However we do not currently make any attempt to retain
history in Lucene specifically for peer recoveries so we may discard it and
fall back to a more expensive file-based recovery. Yet we still retain
sufficient history in the translog to perform an operations-based peer
recovery.

In the long run we would like to fix this by retaining more history in Lucene,
possibly using shard history retention leases (#37165). For now, however, this
commit reverts to performing peer recoveries using the history retained in the
translog regardless of whether soft deletes are enabled or not.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 15, 2019
* elastic/master:
  Avoid double term construction in DfsPhase (elastic#38716)
  Fix typo in DateRange docs (yyy → yyyy) (elastic#38883)
  Introduced class reuses follow parameter code between ShardFollowTasks (elastic#38910)
  Ensure random timestamps are within search boundary (elastic#38753)
  [CI] Muting  method testFollowIndex in IndexFollowingIT
  Update Lucene snapshot repo for 7.0.0-beta1 (elastic#38946)
  SQL: Doc on syntax (identifiers in particular) (elastic#38662)
  Upgrade to Gradle 5.2.1 (elastic#38880)
  Tie break search shard iterator comparisons on cluster alias (elastic#38853)
  Also mmap cfs files for hybridfs (elastic#38940)
  Build: Fix issue with test status logging (elastic#38799)
  Adapt FullClusterRestartIT on master (elastic#38856)
  Fix testAutoFollowing test to use createLeaderIndex() helper method.
  Migrate muted auto follow rolling upgrade test and unmute this test (elastic#38900)
  ShardBulkAction ignore primary response on primary (elastic#38901)
  Recover peers from translog, ignoring soft deletes (elastic#38904)
  Fix NPE on Stale Index in IndicesService (elastic#38891)
  Smarter CCR concurrent file chunk fetching (elastic#38841)
  Fix intermittent failure in ApiKeyIntegTests (elastic#38627)
  re-enable SmokeTestWatcherWithSecurityIT (elastic#38814)
dnhatn added a commit that referenced this pull request May 21, 2019
Currently, we ignore soft-deletes in peer recovery, thus
estimateNumberOfHistoryOperations should always use translog.

Relates #38904
dnhatn added a commit that referenced this pull request May 22, 2019
Currently, we ignore soft-deletes in peer recovery, thus
estimateNumberOfHistoryOperations should always use translog.

Relates #38904
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Currently, we ignore soft-deletes in peer recovery, thus
estimateNumberOfHistoryOperations should always use translog.

Relates elastic#38904
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 25, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts elastic#38904 and elastic#42211
Relates elastic#41536
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536
DaveCTurner added a commit that referenced this pull request Aug 2, 2019
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.

Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants