Fix race between replica reset and primary promotion #32442

ywelsch · 2018-07-27T17:27:53Z

We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues linked below), leading to the discovery of a race between resetting a replica when it learns about a higher term and when the same replica is promoted to primary. This PR fixes the race by distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level operation term. The former is set during the cluster state update or when a replica learns about a new primary. The latter is only incremented under the operation block, which can happen in a delayed fashion. It also solves the issue where a replica that's still adjusting to the new term receives a cluster state update that promotes it to primary, which can happen in the situation of multiple nodes being shut down in short succession. In that case, the cluster state update thread would call asyncBlockOperations in updateShardState, which in turn would throw an exception as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard. This PR therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks (which will all take precedence over operations acquiring permits). Finally, it also moves the primary activation of the replication tracker under the operation block, so that the actual transition to primary only happens under the operation block.

Relates to #32431, #32304 and #32118

elasticmachine · 2018-07-27T17:27:55Z

Pinging @elastic/es-distributed

bleskes

LGTM. Thanks for picking this up. I left some comments that I think will improve it but I'm happy with the current solution too.

bleskes · 2018-07-28T17:25:33Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -473,6 +473,8 @@ public void updateShardState(final ShardRouting newRouting,
                        TimeUnit.MINUTES,
                        () -> {
                            shardStateUpdated.await();
+                            assert primaryTerm == newPrimaryTerm :
+                                "shard term changed on primary. expected [" + newPrimaryTerm + "] but was [" + primaryTerm + "]";


💯 . can you please add the shard routing so we're sure to know where it came from?

bleskes · 2018-07-28T17:27:00Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        if (operationPrimaryTerm > primaryTerm) {
            synchronized (primaryTermMutex) {
                if (operationPrimaryTerm > primaryTerm) {
+                    verifyNotClosed();


I'm wondering - why did you have to add this?

I did not have to (i.e. no failing test). I just saw that we were not rechecking this condition after possibly waiting for a while on primaryTermMutex. The next check 2 lines below will also fail this with an IndexShardNotStartedException. I found it nicer though to throw the IndexShardClosedException if possible.

after second thought, this is less of an issue after I converted blockOperations to asyncBlockOperations in acquireReplicaOperationPermit. I'm going to revert

bleskes · 2018-07-28T17:28:50Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                                termUpdated.await();
+                                // a primary promotion, or another primary term transition, might have been triggered concurrently to this
+                                // recheck under the operation permit if we can skip doing this work
+                                if (operationPrimaryTerm == primaryTerm) {


we can assert the operationPrimary term is always <= than the primary term here.

bleskes · 2018-07-28T17:29:50Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

                }
            }
-        } else {
-            globalCheckpointUpdated = false;


+1 to removing this.

bleskes · 2018-07-28T20:01:56Z

server/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

@@ -182,10 +179,14 @@ private void delayOperations() {
    private void releaseDelayedOperations() {
        final List<DelayedOperation> queuedActions;
        synchronized (this) {


I think we can pull this up to the method. I don't see a reason to drain the queue and the release the ~~queue~~ lock and it will simplify the reasoning a bit.

I'm not sure what you mean. What would you change?

I updated my comment. I mean make this entire method synchronized.

I find the reasoning simpler here if we don't extend the mutex to a section of the code which it does not need to cover. Are you ok keeping it as is?

yes, I'm ok. It's subjective.

bleskes · 2018-07-28T20:02:57Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -2216,10 +2218,11 @@ public void acquireReplicaOperationPermit(final long operationPrimaryTerm, final
                                              final Object debugInfo) {
        verifyNotClosed();
        verifyReplicationTarget();
-        final boolean globalCheckpointUpdated;
        if (operationPrimaryTerm > primaryTerm) {
            synchronized (primaryTermMutex) {


I wonder if we can remove this and only lock mutex on this level (it's always good to avoid multiple locks if possible).

I'm fine relaxing it and using mutex as we now use asyncBlockOperations. This means that in practice, we will have at most have the number of indexing threads block on this (while possibly a concurrent cluster state update comes in, trying to acquire mutex as well). The first indexing thread will increase pendingPrimaryTerm, and all the other ones that are blocked on mutex will just acquire mutex and do a quick noop. All subsequent writes will not acquire the mutex anymore as they will bypass the pre-flight check.

bleskes · 2018-07-28T20:05:11Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-                                "shard term already update.  op term [" + operationPrimaryTerm + "], shardTerm [" + primaryTerm + "]";
+
+                    synchronized (mutex) {
+                        final CountDownLatch termUpdated = new CountDownLatch(1);


I wonder if we should have a method called "setPrimaryTerm" which gets a primary + a runnable to run underly the async block. That method will be called both from here and from updateShardState and make sure that the semantics of the exposing the primary term (after submitting async block and asserting we're under a mutex via assertions) are the same.

I've given this a try in 70262d7

ywelsch · 2018-07-30T07:02:27Z

The CI failure on this highlighted another issue. When a replica is promoted to primary while there is still an ongoing replica operation on the shard, the operation can incorrectly use the primary term of the new shard. In the typical case, this is fortunately caught by the translog due to an extra term check there, but in the presence of a generation rollover at the wrong moment, even that might not hold. The issue is that the term is not incremented under the operation block. To solve this, I've pushed a102ef9 which distinguishes between the primary term that the shard is supposed to have because of the cluster state, and the primary term that is used by operations and by the underlying engine. The former is updated right away, whereas the latter is only updated under the operation block, ensuring that each operation with a permit always sees the correct term. Let me know what you think.

ywelsch · 2018-07-31T09:48:30Z

@elasticmachine retest this please.

…ode to be activated on the replication tracker)

ywelsch · 2018-08-01T07:18:46Z

@bleskes this is ready for another review. The main change (beside addressing your comments) I've done after the first review iteration was to better handle the primary activation of the replication tracker. Before the last iteration, the activation was done on the cluster state update thread, not when the shard actually became primary (i.e. under the operation block). This led to some test failures which are fingers crossed all fixed by the latest iteration.

bleskes

I left some questions to be understand the change.

bleskes · 2018-08-02T07:14:17Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -192,7 +193,8 @@

    protected volatile ShardRouting shardRouting;
    protected volatile IndexShardState state;
-    protected volatile long primaryTerm;
+    protected volatile long pendingPrimaryTerm;


can we have a comment pointing people to the java docs of getPendingPrimaryTerm for explanation of what it means?

fixed in 4b82ca7

bleskes · 2018-08-02T07:27:35Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                        bumpPrimaryTerm(opPrimaryTerm, () -> {
+                            // a primary promotion, or another primary term transition, might have been triggered concurrently to this
+                            // recheck under the operation permit if we can skip doing this work
+                            if (opPrimaryTerm == pendingPrimaryTerm) {


how is that possible? shouldn't the pending primary term update under mutex prevent this?

I'm not sure how it can "prevent" this.

Assume you have a replica term bump followed by a promotion to primary. The replica term bump will call asyncBlockOperations to run the above code. Assume that acquireReplicaOperationPermit leaves the mutex before the code in the operation block gets to execute. Then a primary promotion comes in from the cluster state, updating the pendingPrimaryTerm again. Note that there's a test for this (testReplicaTermIncrementWithConcurrentPrimaryPromotion).

+1 to the test.

So say the increase of the term comes in first. That acquires the mutex, bumps the pendingPrimaryTerm and submits an async block under mutex.

Then the replica gets promoted. That means the term is higher. When the cluster state comes in, it bumps the pendingPrimaryTerm again and submit an async block. We know that they first async block code will run first, followed by the second, and I think that's OK.

If the reverse happens - i.e., the updateShardState comes in first, it will bump the pendingPrimaryTerm to a higher number than the replication operation, which will prevent the replication operation from submitting it's async block, so we're good here too.

What am I missing?

We know that they first async block code will run first?

No we don't know that. See implementation of asyncBlockOperations. Both just submit a task to the generic threadpool. Both potentially race to the call of doBlockOperations

bleskes · 2018-08-02T07:43:40Z

server/src/main/java/org/elasticsearch/index/shard/StoreRecovery.java

            store.associateIndexWithNewTranslog(translogUUID);
            assert indexShard.shardRouting.primary() : "only primary shards can recover from store";
            indexShard.openEngineAndRecoverFromTranslog();
-            indexShard.getEngine().fillSeqNoGaps(indexShard.getPrimaryTerm());
+            indexShard.getEngine().fillSeqNoGaps(indexShard.getPendingPrimaryTerm());


This is another wart... I wonder if we should fold it into openEngineAndRecoverFromTranslog (another change)

bleskes · 2018-08-02T07:47:36Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

                            try {
+                                synchronized (mutex) {
+                                    assert shardRouting.primary();
+                                    // do these updates under the mutex as this otherwise races with subsequent calls of updateShardState


what race conditions do you refer to?

I think this is obsolete and does not need to be done under the mutex anymore. It came from an earlier iteration where I had not introduced the relocated state in ReplicationTracker yet, and where I was using both replicationTracker.isPrimaryMode() and comparing operationPrimaryTerm to pendingPrimaryTerm to figure out if a shard had possibly relocated and wanted this to be an atomic thing.

One problematic place I see is the assertion at the end of updateShardState which checks both isPrimaryMode and the operationTerm + pendingPrimaryTerm. If we don't update the replication tracker + the operationPrimaryTerm atomically under the mutex, this invariant might be violated.

ywelsch · 2018-08-03T07:33:29Z

Thanks @bleskes

We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues linked below), leading to the discovery of a race between resetting a replica when it learns about a higher term and when the same replica is promoted to primary. This commit fixes the race by distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level operation term. The former is set during the cluster state update or when a replica learns about a new primary. The latter is only incremented under the operation block, which can happen in a delayed fashion. It also solves the issue where a replica that's still adjusting to the new term receives a cluster state update that promotes it to primary, which can happen in the situation of multiple nodes being shut down in short succession. In that case, the cluster state update thread would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard. This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks (which will all take precedence over operations acquiring permits). Finally, it also moves the primary activation of the replication tracker under the operation block, so that the actual transition to primary only happens under the operation block. Relates to #32431, #32304 and #32118

If the shard is already closed while bumping the primary term, this can result in an AlreadyClosedException to be thrown. As we use asyncBlockOperations, the exception will be thrown on a thread from the generic thread pool and end up in the uncaught exception handler, failing our tests. Relates to #32442

* 6.x: [Kerberos] Use canonical host name (#32588) Cross-cluster search: preserve cluster alias in shard failures (#32608) [TEST] Allow to run in FIPS JVM (#32607) Handle AlreadyClosedException when bumping primary term [Test] Add ckb to the list of unsupported languages (#32611) SCRIPTING: Move Aggregation Scripts to their own context (#32068) (#32629) [TEST] Enhance failure message when bulk updates have failures [ML] Add ML result classes to protocol library (#32587) Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613) [Rollup] Improve ID scheme for rollup documents (#32558) Mutes failing SQL string function tests due to #32589 Suppress Wildfly test in FIPS JVMs (#32543) Add cluster UUID to Cluster Stats API response (#32206) [ML] Add some ML config classes to protocol library (#32502) [TEST]Split transport verification mode none tests (#32488) [Rollup] Remove builders from DateHistogramGroupConfig (#32555) [ML] Add Detector config classes to protocol library (#32495) [Rollup] Remove builders from MetricConfig (#32536) Fix race between replica reset and primary promotion (#32442) HLRC: Move commercial clients from XPackClient (#32596) Security: move User to protocol project (#32367) Minor fix for javadoc (applicable for java 11). (#32573) Painless: Move Some Lookup Logic to PainlessLookup (#32565) Core: Minor size reduction for AbstractComponent (#32509) INGEST: Enable default pipelines (#32286) (#32591) TEST: Avoid merges in testSeqNoAndCheckpoints [Rollup] Remove builders from HistoGroupConfig (#32533) fixed elements in array of produced terms (#32519) Mutes ReindexFailureTests.searchFailure dues to #28053 Mutes LicensingDocumentationIT due to #32580 Remove the SATA controller from OpenSUSE box [ML] Rename JobProvider to JobResultsProvider (#32551)

Relates #32442

* master: Cross-cluster search: preserve cluster alias in shard failures (#32608) Handle AlreadyClosedException when bumping primary term [TEST] Allow to run in FIPS JVM (#32607) [Test] Add ckb to the list of unsupported languages (#32611) SCRIPTING: Move Aggregation Scripts to their own context (#32068) Painless: Use LocalMethod Map For Lookup at Runtime (#32599) [TEST] Enhance failure message when bulk updates have failures [ML] Add ML result classes to protocol library (#32587) Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613) [Rollup] Update wire version check after backport Suppress Wildfly test in FIPS JVMs (#32543) [Rollup] Improve ID scheme for rollup documents (#32558) ingest: doc: move Dot Expander Processor doc to correct position (#31743) [ML] Add some ML config classes to protocol library (#32502) [TEST]Split transport verification mode none tests (#32488) Core: Move helper date formatters over to java time (#32504) [Rollup] Remove builders from DateHistogramGroupConfig (#32555) [TEST} unmutes SearchAsyncActionTests and adds debugging info [ML] Add Detector config classes to protocol library (#32495) [Rollup] Remove builders from MetricConfig (#32536) Tests: Add rolling upgrade tests for watcher (#32428) Fix race between replica reset and primary promotion (#32442)

Relates #32442

Primary terms were introduced as part of the sequence-number effort (#10708) and added in ES 5.0. Subsequent work introduced the replication tracker which lets the primary own its replication group (#25692) to coordinate recovery and replication. The replication tracker explicitly exposes whether it is operating in primary mode or replica mode, independent of the ShardRouting object that's associated with a shard. During a primary relocation, for example, the primary mode is transferred between the primary relocation source and the primary relocation target. After transferring this so-called primary context, the old primary becomes a replication target and the new primary the replication source, reflected in the replication tracker on both nodes. With the most recent PR in this area (#32442), we finally have a clean transition between a shard that's operating as a primary and issuing sequence numbers and a shard that's serving as a replication target. The transition from one state to the other is enforced through the operation-permit system, where we block permit acquisition during such changes and perform the transition under this operation block, ensuring that there are no operations in progress while the transition is being performed. This finally allows us to turn the best-effort checks that were put in place to prevent shards from being used in the wrong way (i.e. primary as replica, or replica as primary) into hard assertions, making it easier to catch any bugs in this area.

This test has a bug that got introduced during the refactoring of #32442. With 2 concurrent term increments, we can only assert under the operation permit that we are in the correct operation term, not that there is not already another term bump pending. Closes #34862

liaoyanyunde · 2019-09-19T08:37:54Z

@ywelsch

when acquireReplicaOperationPermit alls bumpPrimaryTerm , is it possible that newPrimaryTerm < pendingPrimaryTerm,and lead to bumpPrimaryTerm throw an assert failures?

liaoyanyunde · 2019-09-19T08:55:36Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

                        () -> {
                            shardStateUpdated.await();
+                            assert pendingPrimaryTerm == newPrimaryTerm :


@ywelsch
1、in method updateShardState ,because newPrimaryTerm != pendingPrimaryTerm so the code run into bumpPrimaryTerm method

2、in bumpPrimaryTerm，it will set pendingPrimaryTerm = newPrimaryTerm

3、but before bumpPrimaryTerm set pendingPrimaryTerm = newPrimaryTerm ，onBlocked will run first，in which will assert pendingPrimaryTerm == newPrimaryTerm

Is it reasonable? it seems that the assertion "pendingPrimaryTerm == newPrimaryTerm" will always be false?

liaoyanyunde · 2019-09-19T09:30:05Z

You said that :
" It also solves the issue where a replica that's still adjusting to the new term receives a cluster state update that promotes it to primary, which can happen in the situation of multiple nodes being shut down in short succession. In that case, the cluster state update thread would call asyncBlockOperations in updateShardState, which in turn would throw an exception as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard"

Can you tell me more about this?How is it to avoid failing the shard problem?

liaoyanyunde · 2019-09-19T11:34:18Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                            assert pendingPrimaryTerm == newPrimaryTerm :
+                                "shard term changed on primary. expected [" + newPrimaryTerm + "] but was [" + pendingPrimaryTerm + "]" +
+                                ", current routing: " + currentRouting + ", new routing: " + newRouting;
+                            assert operationPrimaryTerm == newPrimaryTerm;


Is "assert operationPrimaryTerm == newPrimaryTerm;" unnecessary here.

because in bumpPrimaryTerm it will set operationPrimaryTerm == newPrimaryTerm before goto run onBlocked,so it seems that "assert operationPrimaryTerm == newPrimaryTerm;" will always be true.

ywelsch · 2019-09-19T15:06:11Z

@liaoyanyunde can you provide some context for these questions? Are you studying this code to understand how it works? To which goal?

when acquireReplicaOperationPermit alls bumpPrimaryTerm , is it possible that newPrimaryTerm < pendingPrimaryTerm,and lead to bumpPrimaryTerm throw an assert failures?

No, primary terms are always non-decreasing, which is guaranteed by the cluster coordination subsystem.

but before bumpPrimaryTerm set pendingPrimaryTerm = newPrimaryTerm ，onBlocked will run first，in which will assert pendingPrimaryTerm == newPrimaryTerm

The CountDownLatch termUpdated makes sure this scenario can't happen.

Is it reasonable? it seems that the assertion "pendingPrimaryTerm == newPrimaryTerm" will always be false?

We run all our tests with assertions enabled. If this would not hold true, we would quickly learn about it.

Can you tell me more about this?How is it to avoid failing the shard problem?

I'm not sure on what you seek clarification. Also, if this is purely about understanding the current shape of the code, perhaps it's not necessary to understand the full history on how we arrived here.

Is "assert operationPrimaryTerm == newPrimaryTerm;" unnecessary here.

I'm not sure what you mean by "unnecessary". Assertions are there to validate our assumptions about the code. They should obviously always hold true.

liaoyanyunde · 2019-09-20T02:15:38Z

can you provide some context for these questions? Are you studying this code to understand how it works? To which goal?

We are incorporating the fix code for this issue into version 6.3.2. So I am researching this code to understand how it works. But in the process of reading, there are many doubts.

fix promotion race

aecc31a

ywelsch added >bug v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.4.0 v6.5.0 labels Jul 27, 2018

ywelsch requested a review from bleskes July 27, 2018 17:27

undo extra line breaks

9396f53

bleskes approved these changes Jul 28, 2018

View reviewed changes

ywelsch added 2 commits July 30, 2018 08:54

Distinguish between operation and cluster state term

a102ef9

add test + turn check into assertion

021e833

ywelsch added 6 commits July 30, 2018 15:49

Return term as part of operation result

78f8306

review comments

70262d7

chkstyl

afadef0

more ckstyl

0a8ab47

Merge remote-tracking branch 'elastic/master' into fix-promotion-race

75f7a5f

oh my

78a7e82

ywelsch added 6 commits July 31, 2018 13:46

still need for shard state to be updated (in particular for primary m…

63314f4

…ode to be activated on the replication tracker)

activate primary mode under operation block

f7c4ae0

attempt at fixing yet another issue

42d59d8

fix assertion + tests

9a42562

more test fixes

ceb330b

add isRelocated flag to simplify code reasoning

c328417

ywelsch requested a review from bleskes August 1, 2018 07:14

bleskes reviewed Aug 2, 2018

View reviewed changes

refer to javadocs of getPendingPrimaryterm

4b82ca7

ywelsch merged commit 0d60e8a into elastic:master Aug 3, 2018

This was referenced Aug 3, 2018

ClusterDisruptionIT#testSendingShardFailure fails on CI #32431

Closed

[ci] CorruptedFileIT.testReplicaCorruption #32304

Closed

[CI] CreateIndexIT.testIndexWithUnknownSetting failure #32118

Closed

dnhatn added a commit that referenced this pull request Aug 6, 2018

CCR: Expose the operation primary term

1572b6f

Relates #32442

dnhatn added a commit that referenced this pull request Aug 6, 2018

CCR: Expose the operation primary term

c394eb9

Relates #32442

ywelsch mentioned this pull request Aug 7, 2018

Verify primary mode usage with assertions #32667

Merged

DaveCTurner mentioned this pull request Oct 26, 2018

[CI] IndexShardTests.testConcurrentTermIncreaseOnReplicaShard failed on 6.4 #34862

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

liaoyanyunde reviewed Sep 19, 2019

View reviewed changes

Fix race between replica reset and primary promotion #32442

Fix race between replica reset and primary promotion #32442

Conversation

ywelsch commented Jul 27, 2018 • edited Loading

elasticmachine commented Jul 27, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Jul 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Jul 30, 2018

ywelsch commented Jul 31, 2018

ywelsch commented Aug 1, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Aug 3, 2018

liaoyanyunde commented Sep 19, 2019 • edited Loading

liaoyanyunde Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

liaoyanyunde commented Sep 19, 2019 • edited Loading

liaoyanyunde Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

ywelsch commented Sep 19, 2019

liaoyanyunde commented Sep 20, 2019 • edited Loading

ywelsch commented Jul 27, 2018 •

edited

Loading

bleskes Jul 28, 2018 •

edited

Loading

bleskes Aug 2, 2018 •

edited

Loading

liaoyanyunde commented Sep 19, 2019 •

edited

Loading

liaoyanyunde Sep 19, 2019 •

edited

Loading

liaoyanyunde commented Sep 19, 2019 •

edited

Loading

liaoyanyunde Sep 19, 2019 •

edited

Loading

liaoyanyunde commented Sep 20, 2019 •

edited

Loading