Harden periodically check to avoid endless flush loop #29125

dnhatn · 2018-03-17T14:35:07Z

In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition.

The periodically flush condition is enabled only if it is disabled after a flush.
If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state.

(1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size.

We use method uncommittedSizeInBytes to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size.
We use method sizeOfGensAboveSeqNoInBytes to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one.

Suppose we have 3 translogs gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1.

This commit removes both sizeOfGensAboveSeqNoInBytes and uncommittedSizeInBytes methods, then enforces an engine to use only sizeInBytesByMinGen method to evaluate the periodically flush condition.

Closes #29097
Relates ##28350

In elastic#28350, we fixed an endless flushing loop which can happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it will be disabled after a flush. 2. If the periodically flush condition is true then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition elastic#1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs gen1={elastic#1,elastic#2}, gen2={}, gen3={elastic#3} and seqno=elastic#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit ensures sizeOfGensAboveSeqNoInBytes use the same algorithm from uncommittedSizeInBytes Closes elastic#29097

elasticmachine · 2018-03-17T14:35:09Z

Pinging @elastic/es-distributed

…rSeqNo

bleskes

This is a great catch. The source of this is Too-Many-Ways-To-Do-The-Same-Thing and I don't think we quite removed it. I.e., the translog still exposes different methods that do slightly different things and it's highly likely this will happen again. This is caused by the attempt to transition into a sequence numbers based translog recovery related API. I don't think it will happen soon and I don't feel happy with leaving the hybrid state as is (we can do a full transition when and if we're ready).

I suggest removing the translog's uncommittedOps and uncommittedBytes and only expose the sizeInBytesByMinGen and a new equivalent totalOpsByMinGen. The engine will have to do the sequence number juggling using the getMinGenerationForSeqNo method. The API will feel less friendly but at least the logic is not spread out in multiple methods.

I also think we can have a stronger test - set the flush threshold to something smallish and the randomly index and flush (both based on a periodic check and just because). After each flush we should check that shouldPeriodicallyFlush returns false.

WDYT?

dnhatn · 2018-03-19T13:53:01Z

@bleskes

I suggest removing the translog's uncommittedOps and uncommittedBytes and only expose the sizeInBytesByMinGen and a new equivalent totalOpsByMinGen.

Yes, I will definitely give this a try.

I also think we can have a stronger test - set the flush threshold to something smallish and the randomly index and flush (both based on a periodic check and just because). After each flush we should check that shouldPeriodicallyFlush returns false.

Ok, I will add a new test for this.

dnhatn · 2018-03-19T23:48:02Z

@bleskes I've removed both uncommittedOps and uncommittedBytes methods from translog and added a stress test for shouldPeriodicallyFlush. This test was failed around 25% without the patch. Please have a look, thank you!

bleskes

Thanks @dnhatn . I think this is the right approach. Test looks much better. I left more comments.

bleskes · 2018-03-20T07:59:17Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -1361,7 +1361,8 @@ final boolean tryRenewSyncCommit() {
            ensureOpen();
            ensureCanFlush();
            String syncId = lastCommittedSegmentInfos.getUserData().get(SYNC_COMMIT_ID);
-            if (syncId != null && translog.uncommittedOperations() == 0 && indexWriter.hasUncommittedChanges()) {
+            if (syncId != null && indexWriter.hasUncommittedChanges()
+                && translog.totalOperationsByMinGen(translog.uncommittedGeneration()) == 0) {


can we extract the uncommitted gen from the lastCommittedSegmentInfos? Also uncommitted gen is confusing because the gen's id is in the commit point.

bleskes · 2018-03-20T08:00:07Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -1383,19 +1384,20 @@ final boolean tryRenewSyncCommit() {
    @Override
    public boolean shouldPeriodicallyFlush() {
        ensureOpen();
+        final long translogGenerationOfCurrentCommit = translog.uncommittedGeneration();


please get this from the current commit. I don't think it makes sense for the engine to get this from the translog. I can know on it's own!

bleskes · 2018-03-20T13:25:34Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-        final long uncommittedSizeOfNewCommit = translog.sizeOfGensAboveSeqNoInBytes(localCheckpointTracker.getCheckpoint() + 1);
-        return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit;
+        final long translogGenerationOfNewCommit =
+            translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1, false).translogFileGeneration;


I'm not happy with the extra boolean flag to include / exclude the current generation as a fall back. It's too subtle an error prone. How about doing the following (I think you had it in the past and we moved away from it towards the uncommittedX api - sorry for that):

If the min gen for the local checkpoint + 1 is > current committed gen , return true.

If the min gen is equal to the current translog gen, the current gen is not empty (using totalOperationsByMinGen) and the local checkpoint is equal to the max seq#, return true.

bleskes · 2018-03-20T13:26:58Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+                        assertThat(engine.shouldPeriodicallyFlush(), equalTo(false));
+                    }
+                } catch (EngineException ex) {
+                    // This happened because the test may have opened too many files (max 2048 fds on test)


this is no good :) maybe change the retention policy to clean up?

bleskes · 2018-03-20T13:29:04Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+        indexSettings.updateIndexMetaData(indexMetaData);
+        engine.onSettingsChanged();
+        final int iterations = scaledRandomIntBetween(100, 1000);
+        final List<Long> pendingSeqNo = new ArrayList<>();


maybe instead of pendingSeqNo, randomg take a seq# from the range (localCheckpoint:localCheckpoint+5] ?

Boaz, thanks for this hint. This change makes the test failed more than 50% without the patch and also eliminated the file descriptor issue.

dnhatn · 2018-03-20T22:36:06Z

@bleskes I've addressed your comments. Can you have another look? Thank you!

bleskes

Thx @dnhatn . I left some nits. Code looks good.

bleskes · 2018-03-21T10:22:18Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-         * This condition will change if the `uncommittedSize` of the new commit is smaller than
-         * the `uncommittedSize` of the current commit. This method is to maintain translog only,
-         * thus the IndexWriter#hasUncommittedChanges condition is not considered.
+         * We should only flush ony if the shouldFlush condition can become false after flushing. This condition will change if:


nit - we flush also it will reduce the size of uncommitted gens but strictly speaking it doesn't mean it will be below the threshold

bleskes · 2018-03-21T10:26:17Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-        return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit;
+        final long translogGenerationOfNewCommit =
+            translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1).translogFileGeneration;
+        return translogGenerationOfLastCommit < translogGenerationOfNewCommit


can you add a comment that if translogGenerationOfLastCommit== translogGenerationOfNewCommit and localCheckpointTracker.getCheckpoint() == localCheckpointTracker.getMaxSeqNo() we know that the last generation must contain operation as it's size is above the threshold and the threshold is guaranteed to be higher than an empty translog gen by the setting validation. Therefore, flushing will improve things. This is tricky to figure out (which is why I didn't like this approach originally).

PS - as far as I can tell this can only happen if the translog generation file size limit is close or above to the flush threshold and we can end up here if one indexes faster then it takes to roll generations. If that's true, can you add this to the comment? This just too subtle but I can't see how to avoid it.

I've updated the comment.

bleskes · 2018-03-21T10:38:16Z

server/src/main/java/org/elasticsearch/index/translog/Translog.java

-     */
-    public long uncommittedSizeInBytes() {
-        return sizeInBytesByMinGen(deletionPolicy.getTranslogGenerationOfLastCommit());
+    public long uncommittedGeneration() {


can this be removed (preferably) or made package private for testing? I don't want production code to use this - it just adds circular dependencies.

I've removed this from Translog and moved this to TestTranslog.

Finally, I prefer using translog stats :). All removed.

bleskes · 2018-03-21T10:41:22Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            final long seqno = randomLongBetween(Math.max(0, localCheckPoint), localCheckPoint + 5);
+            final ParsedDocument doc = testParsedDocument(Long.toString(seqno), null, testDocumentWithTextField(), SOURCE, null);
+            engine.index(replicaIndexForDoc(doc, 1L, seqno, false));
+            if (rarely() || engine.getTranslog().shouldRollGeneration()) {


can we sometimes skip rolling generations ? (we should not rely on rolling always happening in the test)

bleskes · 2018-03-21T10:41:45Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            if (rarely() || engine.getTranslog().shouldRollGeneration()) {
+                engine.rollTranslogGeneration();
+            }
+            if (engine.shouldPeriodicallyFlush()) {


can we sometime flush anyway? people may force flush manually and we want to make sure all works

dnhatn · 2018-03-21T22:37:24Z

@bleskes Can you take another look?

dnhatn · 2018-03-22T12:53:30Z

PS - as far as I can tell this can only happen if the translog generation file size limit is close or above to the flush threshold and we can end up here if one indexes faster then it takes to roll generations. If that's true, can you add this to the comment? This just too subtle but I can't see how to avoid it.

We can avoid if the default generation is not a factor of the flush threshold (eg. adding an empty translog + N * generations != flush). WDYT?

bleskes · 2018-03-22T12:56:36Z

We can avoid if the default generation is not a factor of the flush threshold (eg. adding an empty translog + N * generations != flush). WDYT?

I'm not sure I follow. Can you please unpack this?

dnhatn · 2018-03-22T13:42:19Z

@bleskes What happened may be slightly different from your statement. I think an endless loop may have occurred when the uncommitted size is close to the flush threshold, the current generation is also close to the generation threshold, and a faster operation rolls a new generation, then a slower operation gets into an endless loop. If this is the case, the size of N translog files and an empty translog satisfies these conditions:

Size of N translog files <= the flush threshold
Size of N translog files + an empty translog file > the flush threshold

Assuming that there was no manual flush, these conditions should not be satisfied at the same time if the generation size is not a factor of the flush threshold.

dnhatn · 2018-03-22T16:38:57Z

Discussed with Boaz on another channel. My last comment is not valid as it's based on the old code while Boaz's on the new code. I've updated comment in 4240241.

dnhatn · 2018-03-22T18:23:24Z

@bleskes Thank you very much for your helpful reviews.

In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it is disabled after a flush. 2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee that a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3 while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit removes both `sizeOfGensAboveSeqNoInBytes` and `uncommittedSizeInBytes` methods, then enforces an engine to use only `sizeInBytesByMinGen` method to evaluate the periodically flush condition. Closes #29097 Relates ##28350

* es/master: (27 commits) [Docs] Add rank_eval size parameter k (#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (#29172) Docs: Link C++ client lib elasticlient (#28949) [DOCS] Unregister repository instead of deleting it (#29206) Docs: HighLevelRestClient#multiSearch (#29144) Add Z value support to geo_shape Remove type casts in logging in server component (#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878) REST : Split `RestUpgradeAction` into two actions (#29124) Add error file docs to important settings Add note to low-level client docs for DNS caching (#29213) Harden periodically check to avoid endless flush loop (#29125) Remove deprecated options for query_string (#29203) REST high-level client: add force merge API (#28896) Remove license information from README.textile (#29198) Decouple more classes from XContentBuilder and make builder strict (#29197) [Docs] Fix missing closing block in cluster/misc.asciidoc RankEvalRequest should implement IndicesRequest (#29188) Use EnumMap in ClusterBlocks (#29112) ...

* es/6.x: (29 commits) [Docs] Add rank_eval size parameter k (#29218) Docs: Update docs/index_.asciidoc (#29172) Docs: Link C++ client lib elasticlient (#28949) Docs: HighLevelRestClient#multiSearch (#29144) [DOCS] Remove ignore_z_value parameter link Add Z value support to geo_shape Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878) REST : Split `RestUpgradeAction` into two actions (#29124) [DOCS] Unregister repository instead of deleting it (#29206) Remove type casts in logging in server component (#28807) Add error file docs to important settings Add note to low-level client docs for DNS caching (#29213) testShrinkAfterUpgrade should only set mapping.single_type if bwc version > 5.5.0 Harden periodically check to avoid endless flush loop (#29125) REST high-level client: add force merge API (#28896) Remove license information from README.textile (#29198) Decouple more classes from XContentBuilder and make builder strict (#29197) Propagate mapping.single_type setting on shrinked index (#29202) [Docs] Fix missing closing block in cluster/misc.asciidoc RankEvalRequest should implement IndicesRequest (#29188) ...

dnhatn added >bug v7.0.0 v6.3.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 17, 2018

dnhatn requested review from s1monw, bleskes and ywelsch March 17, 2018 14:35

dnhatn mentioned this pull request Mar 17, 2018

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

Closed

dnhatn added 3 commits March 17, 2018 15:26

share code between sizeOfGensAboveSeqNoInBytes and getMinGenerationFo…

c2d8d4d

…rSeqNo

comment

fc1a9fc

Merge branch 'master' into fix-infinite-flush

024fa2a

bleskes suggested changes Mar 19, 2018

View reviewed changes

dnhatn added 2 commits March 19, 2018 19:40

Remove uncommitted ops and operations

330527c

add stress test

2a74071

Comment

6ee9373

bleskes suggested changes Mar 20, 2018

View reviewed changes

dnhatn added 5 commits March 20, 2018 16:16

Merge branch 'master' into fix-infinite-flush

76ddf6b

Do not ask translog for the last commit - engine has it

7587ee6

simplify test

d02770e

Inline the last commit

2e7897f

single loop

50c352d

bleskes approved these changes Mar 21, 2018

View reviewed changes

dnhatn added 3 commits March 21, 2018 18:02

Merge branch 'master' into fix-infinite-flush

7e43393

improve the engine test

031b112

Remove #uncommittedGeneration from translog

05bffec

add boaz comment

4b8708f

dnhatn added 4 commits March 21, 2018 18:46

Let’s use stats

ef8a2d3

Remove unused imports

00890d5

restore scale randome ops

be37774

more comment

a24b434

dnhatn added the v6.2.4 label Mar 22, 2018

fix comment after talked to Boaz

4240241

dnhatn merged commit 14157c8 into elastic:master Mar 22, 2018

dnhatn deleted the fix-infinite-flush branch March 22, 2018 18:31

dnhatn added the backport pending label Mar 22, 2018

dnhatn removed the backport pending label Mar 22, 2018

dnhatn mentioned this pull request Apr 4, 2018

Add periodic flush count to flush stats #29360

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden periodically check to avoid endless flush loop #29125

Harden periodically check to avoid endless flush loop #29125

dnhatn commented Mar 17, 2018 •

edited

Loading

elasticmachine commented Mar 17, 2018

bleskes left a comment •

edited

Loading

dnhatn commented Mar 19, 2018

dnhatn commented Mar 19, 2018

bleskes left a comment

bleskes Mar 20, 2018

bleskes Mar 20, 2018

bleskes Mar 20, 2018 •

edited

Loading

bleskes Mar 20, 2018

bleskes Mar 20, 2018

dnhatn Mar 20, 2018 •

edited

Loading

dnhatn commented Mar 20, 2018

bleskes left a comment

bleskes Mar 21, 2018

bleskes Mar 21, 2018

bleskes Mar 21, 2018

dnhatn Mar 21, 2018

bleskes Mar 21, 2018

dnhatn Mar 21, 2018

dnhatn Mar 21, 2018

bleskes Mar 21, 2018

dnhatn Mar 21, 2018

bleskes Mar 21, 2018

dnhatn Mar 21, 2018

dnhatn commented Mar 21, 2018

dnhatn commented Mar 22, 2018

bleskes commented Mar 22, 2018

dnhatn commented Mar 22, 2018 •

edited

Loading

dnhatn commented Mar 22, 2018 •

edited

Loading

dnhatn commented Mar 22, 2018

Harden periodically check to avoid endless flush loop #29125

Harden periodically check to avoid endless flush loop #29125

Conversation

dnhatn commented Mar 17, 2018 • edited Loading

elasticmachine commented Mar 17, 2018

bleskes left a comment • edited Loading

Choose a reason for hiding this comment

dnhatn commented Mar 19, 2018

dnhatn commented Mar 19, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

dnhatn commented Mar 20, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Mar 21, 2018

dnhatn commented Mar 22, 2018

bleskes commented Mar 22, 2018

dnhatn commented Mar 22, 2018 • edited Loading

dnhatn commented Mar 22, 2018 • edited Loading

dnhatn commented Mar 22, 2018

dnhatn commented Mar 17, 2018 •

edited

Loading

bleskes left a comment •

edited

Loading

bleskes Mar 20, 2018 •

edited

Loading

dnhatn Mar 20, 2018 •

edited

Loading

dnhatn commented Mar 22, 2018 •

edited

Loading

dnhatn commented Mar 22, 2018 •

edited

Loading