Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset replica engine to global checkpoint on promotion #33473

Merged
merged 14 commits into from
Sep 12, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 6, 2018

When a replica starts following a newly promoted primary, it may have
some operations which don't exist on the new primary. Thus we need to
throw those operations to align a replica with the new primary. This can
be done by first resetting an engine from the safe commit, then replaying
the local translog up to the global checkpoint.

Relates #32867

When a replica starts following a newly promoted primary, it may have
some operations which don't exist on the new primary. Thus we need to
throw those operations to align a replica with the new primary. This can
be done by resetting an engine from the safe commit, then replaying
the local translog up to the global checkpoint.
@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.5.0 labels Sep 6, 2018
@dnhatn dnhatn requested review from s1monw, bleskes and ywelsch September 6, 2018 16:44
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Sep 6, 2018

This is the first part of #32867 (comment).

@dnhatn dnhatn added the review label Sep 6, 2018
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. This looks much simpler than the overall one.


public boolean isRecovery() {
return this == PEER_RECOVERY || this == LOCAL_TRANSLOG_RECOVERY;
}

boolean isLocal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we call it isRemote() and then we don't need to invert the if statements in the Engine?

@@ -1163,11 +1157,16 @@ public Operation(Term uid, long seqNo, long primaryTerm, long version, VersionTy
PRIMARY,
REPLICA,
PEER_RECOVERY,
LOCAL_TRANSLOG_RECOVERY;
LOCAL_TRANSLOG_RECOVERY,
LOCAL_RESETTING;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this LOCAL_RESET then it's consistent with RECOVERY not a continuous form

}

private int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
Runnable onPerOperationRecovered) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call it onOperationRecovered?

}
shard.close("test", false);
} finally {
IOUtils.close(shard.store());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this into a try / with block?

translogRecoveryStats::incrementRecoveredOperations);
}

private int runTranslogRecoveryAfterResetting(Engine engine, Translog.Snapshot snapshot) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a javadoc comment what this does vs. the ordinary recovery. I also wonder if we should maybe only have this version int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin, Runnable onPerOperationRecovered) throws IOException and pass in closures where we actually call it. Since we really only call it from a single place. It would make this class a little less complex

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn . I left a bunch of comments. Looking good.

@@ -833,7 +834,7 @@ public IndexResult index(Index index) throws IOException {
indexResult = new IndexResult(
plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
}
if (index.origin() != Operation.Origin.LOCAL_TRANSLOG_RECOVERY) {
if (index.origin().isRemote()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename this to isTranslog? then it will tie directly to what's happening in this code.

@@ -109,6 +109,7 @@ public synchronized void markSeqNoAsCompleted(final long seqNo) {
* @param checkpoint the local checkpoint to reset this tracker to
*/
public synchronized void resetCheckpoint(final long checkpoint) {
// TODO: remove this method as we no longer need it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we waiting on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tests which verify that we restore the local checkpoint after resetting it to the global checkpoint. I decided to leave out this method in PR to minimize the changes. I will remove this method in the next PR.

int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot) throws IOException {
recoveryState.getTranslog().totalOperations(snapshot.totalOperations());
recoveryState.getTranslog().totalOperationsOnStart(snapshot.totalOperations());
int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some java docs to what to onOperationRecovered mean?

final String translogUUID = store.readLastCommittedSegmentsInfo().getUserData().get(Translog.TRANSLOG_UUID_KEY);
final long globalCheckpoint = Translog.readGlobalCheckpoint(translogConfig.getTranslogPath(), translogUUID);
final long minRetainedTranslogGen = Translog.readMinTranslogGeneration(translogConfig.getTranslogPath(), translogUUID);
store.trimUnsafeCommits(globalCheckpoint, minRetainedTranslogGen, config.getIndexSettings().getIndexVersionCreated());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels weird to do these things here - this method now only creates an engine but doesn't change the IndexShard fields - imo it shouldn't touch the store (because it doesn't know anything about any current engine running it)

final Engine engine = engineFactory.newReadWriteEngine(config);
onNewEngine(engine);
engine.onSettingsChanged();
active.set(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment - it's weird this change the IndexShard active state without actually exposing the engine.

// - replica1 has {doc1}
// - replica2 has {doc1, doc2}
// - replica3 can have either {doc2} only if operation-based recovery or {doc1, doc2} if file-based recovery
shards.assertAllEqual(initDocs + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w00t

final List<Translog.Operation> expectedOps = new ArrayList<>(initOperations);
expectedOps.add(op2);
assertThat(snapshot, containsOperationsInAnyOrder(expectedOps));
List<Translog.Operation> operations = TestTranslog.drainAll(snapshot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we lost the check that initOperations are also part of the snapshot?

@@ -1879,13 +1873,16 @@ public void testRecoverFromStoreRemoveStaleOperations() throws Exception {
SourceToParse.source(indexName, "_doc", "doc-1", new BytesArray("{}"), XContentType.JSON));
flushShard(shard);
assertThat(getShardDocUIDs(shard), containsInAnyOrder("doc-0", "doc-1"));
// Simulate resync (without rollback): Noop #1, index #2
acquireReplicaOperationPermitBlockingly(shard, shard.pendingPrimaryTerm + 1);
// Here we try to simulate the primary fail-over without rollback which is no longer the case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this comment. Can you clarify please?

.indexServiceSafe(replicaShardRouting.index()).getShard(replicaShardRouting.id());
final Set<String> docsOnReplica;
try {
docsOnReplica = IndexShardTestCase.getShardDocUIDs(replicaShard);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be a lot of work to check that the source, primary terms and seq# are also identical?

@dnhatn
Copy link
Member Author

dnhatn commented Sep 11, 2018

@s1monw and @bleskes This is ready for another round. Can you please have a look?

@dnhatn dnhatn requested review from s1monw and bleskes September 11, 2018 00:07
verifyNotClosed();
IOUtils.close(currentEngineReference.getAndSet(null));
trimUnsafeCommits();
newEngine = createNewEngine(newEngineConfig());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering. Would it make sense to do the trimUnsafeCommits as part of the new engine creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was before, but we prefer not to modify the Store implicitly (#33473 (comment)).

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Sep 12, 2018

@bleskes, @s1monw and @ywelsch Thanks for reviewing :).

@dnhatn dnhatn merged commit 743327e into elastic:master Sep 12, 2018
@dnhatn dnhatn deleted the rollback-replica-on-promotion branch September 12, 2018 02:09
dnhatn added a commit that referenced this pull request Sep 12, 2018
If a shard is empty, it won't rollback its engine on promotion.
This commit adjusts the expectation in the rollback test.

Relates #33473
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request Sep 12, 2018
* master: (43 commits)
  [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603)
  Update AWS SDK to 1.11.406  in repository-s3 (elastic#30723)
  Expose CCR stats to monitoring (elastic#33617)
  [Docs] Update match-query.asciidoc (elastic#33610)
  TEST: Adjust rollback condition when shard is empty
  [CCR] Improve shard follow task's retryable error handling (elastic#33371)
  Forbid negative `weight` in Function Score Query (elastic#33390)
  Clarify context suggestions filtering and boosting (elastic#33601)
  Disable CCR REST endpoints if CCR disabled (elastic#33619)
  Lower version on full cluster restart settings test
  Upgrade remote cluster settings (elastic#33537)
  NETWORKING: http.publish_host Should Contain CNAME (elastic#32806)
  Add test coverage for global checkpoint listeners
  Reset replica engine to global checkpoint on promotion (elastic#33473)
  HLRC: ML Delete Forecast API (elastic#33526)
  Remove debug logging in full cluster restart tests (elastic#33612)
  Expose CCR to the transport client (elastic#33608)
  Mute testIndexDeletionWhenNodeRejoins
  SQL: Make Literal a NamedExpression (elastic#33583)
  [DOCS] Adds missing built-in user information (elastic#33585)
  ...
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request Sep 12, 2018
* master: (128 commits)
  [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603)
  Update AWS SDK to 1.11.406  in repository-s3 (elastic#30723)
  Expose CCR stats to monitoring (elastic#33617)
  [Docs] Update match-query.asciidoc (elastic#33610)
  TEST: Adjust rollback condition when shard is empty
  [CCR] Improve shard follow task's retryable error handling (elastic#33371)
  Forbid negative `weight` in Function Score Query (elastic#33390)
  Clarify context suggestions filtering and boosting (elastic#33601)
  Disable CCR REST endpoints if CCR disabled (elastic#33619)
  Lower version on full cluster restart settings test
  Upgrade remote cluster settings (elastic#33537)
  NETWORKING: http.publish_host Should Contain CNAME (elastic#32806)
  Add test coverage for global checkpoint listeners
  Reset replica engine to global checkpoint on promotion (elastic#33473)
  HLRC: ML Delete Forecast API (elastic#33526)
  Remove debug logging in full cluster restart tests (elastic#33612)
  Expose CCR to the transport client (elastic#33608)
  Mute testIndexDeletionWhenNodeRejoins
  SQL: Make Literal a NamedExpression (elastic#33583)
  [DOCS] Adds missing built-in user information (elastic#33585)
  ...
dnhatn added a commit that referenced this pull request Sep 12, 2018
When a replica starts following a newly promoted primary, it may have
some operations which don't exist on the new primary. Thus we need to
throw those operations to align a replica with the new primary. This can
be done by first resetting an engine from the safe commit, then replaying
the local translog up to the global checkpoint.

Relates #32867
dnhatn added a commit that referenced this pull request Sep 12, 2018
If a shard is empty, it won't rollback its engine on promotion.
This commit adjusts the expectation in the rollback test.

Relates #33473
dnhatn added a commit that referenced this pull request Sep 13, 2018
dnhatn added a commit that referenced this pull request Sep 13, 2018
dnhatn added a commit that referenced this pull request Sep 20, 2018
If a shard was serving as a replica when another shard was promoted to
primary, then its Lucene index was reset to the global checkpoint.
However, if the new primary fails before the primary/replica resync
completes and we are now being promoted, we have to restore the reverted
operations by replaying the translog to avoid losing acknowledged writes.

Relates #33473
Relates #32867
dnhatn added a commit that referenced this pull request Sep 20, 2018
If a shard was serving as a replica when another shard was promoted to
primary, then its Lucene index was reset to the global checkpoint.
However, if the new primary fails before the primary/replica resync
completes and we are now being promoted, we have to restore the reverted
operations by replaying the translog to avoid losing acknowledged writes.

Relates #33473
Relates #32867
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Oct 1, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReader with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since elastic#33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
elastic#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version to
the old DirectoryReader. This is possible because these two readers come
from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes elastic#27650
Relates elastic#33473
dnhatn added a commit that referenced this pull request Oct 4, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
dnhatn added a commit that referenced this pull request Oct 4, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
kcm pushed a commit that referenced this pull request Oct 30, 2018
If a shard was serving as a replica when another shard was promoted to
primary, then its Lucene index was reset to the global checkpoint.
However, if the new primary fails before the primary/replica resync
completes and we are now being promoted, we have to restore the reverted
operations by replaying the translog to avoid losing acknowledged writes.

Relates #33473
Relates #32867
kcm pushed a commit that referenced this pull request Oct 30, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
dnhatn added a commit that referenced this pull request Jul 2, 2019
With Lucene rollback (#33473), we should never have more than one
primary term for each sequence number. Therefore we don't have to sort
by the primary term when reading soft-deletes.
dnhatn added a commit that referenced this pull request Jul 2, 2019
With Lucene rollback (#33473), we should never have more than one
primary term for each sequence number. Therefore we don't have to sort
by the primary term when reading soft-deletes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants