Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden periodically check to avoid endless flush loop #29125

Merged
merged 21 commits into from
Mar 22, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Mar 17, 2018

In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition.

  1. The periodically flush condition is enabled only if it is disabled after a flush.

  2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state.

(1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size.

  • We use method uncommittedSizeInBytes to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size.

  • We use method sizeOfGensAboveSeqNoInBytes to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one.

Suppose we have 3 translogs gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1.

This commit removes both sizeOfGensAboveSeqNoInBytes and uncommittedSizeInBytes methods, then enforces an engine to use only sizeInBytesByMinGen method to evaluate the periodically flush condition.

Closes #29097
Relates ##28350

In elastic#28350, we fixed an endless flushing loop which can happen on
replicas by tightening the relation between the flush action and the
periodically flush condition.

1. The periodically flush condition is enabled only if it will be
disabled after a flush.

2. If the periodically flush condition is true then a flush will
actually happen regardless of Lucene state.

(1) and (2) guarantee a flushing loop will be terminated. Sadly, the
condition elastic#1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted size.

- We use method `uncommittedSizeInBytes` to calculate current
uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.

- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future
uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one
by one.

Suppose we have 3 translogs gen1={elastic#1,elastic#2}, gen2={}, gen3={elastic#3} and
seqno=elastic#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3
while sizeOfGensAboveSeqNoInBytes is sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.

This commit ensures sizeOfGensAboveSeqNoInBytes use the same algorithm
from uncommittedSizeInBytes

Closes elastic#29097
@dnhatn dnhatn added >bug v7.0.0 v6.3.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 17, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch. The source of this is Too-Many-Ways-To-Do-The-Same-Thing and I don't think we quite removed it. I.e., the translog still exposes different methods that do slightly different things and it's highly likely this will happen again. This is caused by the attempt to transition into a sequence numbers based translog recovery related API. I don't think it will happen soon and I don't feel happy with leaving the hybrid state as is (we can do a full transition when and if we're ready).

I suggest removing the translog's uncommittedOps and uncommittedBytes and only expose the sizeInBytesByMinGen and a new equivalent totalOpsByMinGen. The engine will have to do the sequence number juggling using the getMinGenerationForSeqNo method. The API will feel less friendly but at least the logic is not spread out in multiple methods.

I also think we can have a stronger test - set the flush threshold to something smallish and the randomly index and flush (both based on a periodic check and just because). After each flush we should check that shouldPeriodicallyFlush returns false.

WDYT?

@dnhatn
Copy link
Member Author

dnhatn commented Mar 19, 2018

@bleskes

I suggest removing the translog's uncommittedOps and uncommittedBytes and only expose the sizeInBytesByMinGen and a new equivalent totalOpsByMinGen.

Yes, I will definitely give this a try.

I also think we can have a stronger test - set the flush threshold to something smallish and the randomly index and flush (both based on a periodic check and just because). After each flush we should check that shouldPeriodicallyFlush returns false.

Ok, I will add a new test for this.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 19, 2018

@bleskes I've removed both uncommittedOps and uncommittedBytes methods from translog and added a stress test for shouldPeriodicallyFlush. This test was failed around 25% without the patch. Please have a look, thank you!

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn . I think this is the right approach. Test looks much better. I left more comments.

@@ -1361,7 +1361,8 @@ final boolean tryRenewSyncCommit() {
ensureOpen();
ensureCanFlush();
String syncId = lastCommittedSegmentInfos.getUserData().get(SYNC_COMMIT_ID);
if (syncId != null && translog.uncommittedOperations() == 0 && indexWriter.hasUncommittedChanges()) {
if (syncId != null && indexWriter.hasUncommittedChanges()
&& translog.totalOperationsByMinGen(translog.uncommittedGeneration()) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract the uncommitted gen from the lastCommittedSegmentInfos? Also uncommitted gen is confusing because the gen's id is in the commit point.

@@ -1383,19 +1384,20 @@ final boolean tryRenewSyncCommit() {
@Override
public boolean shouldPeriodicallyFlush() {
ensureOpen();
final long translogGenerationOfCurrentCommit = translog.uncommittedGeneration();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please get this from the current commit. I don't think it makes sense for the engine to get this from the translog. I can know on it's own!

final long uncommittedSizeOfNewCommit = translog.sizeOfGensAboveSeqNoInBytes(localCheckpointTracker.getCheckpoint() + 1);
return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit;
final long translogGenerationOfNewCommit =
translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1, false).translogFileGeneration;
Copy link
Contributor

@bleskes bleskes Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with the extra boolean flag to include / exclude the current generation as a fall back. It's too subtle an error prone. How about doing the following (I think you had it in the past and we moved away from it towards the uncommittedX api - sorry for that):

  1. If the min gen for the local checkpoint + 1 is > current committed gen , return true.
  2. If the min gen is equal to the current translog gen, the current gen is not empty (using totalOperationsByMinGen) and the local checkpoint is equal to the max seq#, return true.

assertThat(engine.shouldPeriodicallyFlush(), equalTo(false));
}
} catch (EngineException ex) {
// This happened because the test may have opened too many files (max 2048 fds on test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no good :) maybe change the retention policy to clean up?

indexSettings.updateIndexMetaData(indexMetaData);
engine.onSettingsChanged();
final int iterations = scaledRandomIntBetween(100, 1000);
final List<Long> pendingSeqNo = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of pendingSeqNo, randomg take a seq# from the range (localCheckpoint:localCheckpoint+5] ?

Copy link
Member Author

@dnhatn dnhatn Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boaz, thanks for this hint. This change makes the test failed more than 50% without the patch and also eliminated the file descriptor issue.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 20, 2018

@bleskes I've addressed your comments. Can you have another look? Thank you!

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @dnhatn . I left some nits. Code looks good.

* This condition will change if the `uncommittedSize` of the new commit is smaller than
* the `uncommittedSize` of the current commit. This method is to maintain translog only,
* thus the IndexWriter#hasUncommittedChanges condition is not considered.
* We should only flush ony if the shouldFlush condition can become false after flushing. This condition will change if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - we flush also it will reduce the size of uncommitted gens but strictly speaking it doesn't mean it will be below the threshold

return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit;
final long translogGenerationOfNewCommit =
translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1).translogFileGeneration;
return translogGenerationOfLastCommit < translogGenerationOfNewCommit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that if translogGenerationOfLastCommit== translogGenerationOfNewCommit and localCheckpointTracker.getCheckpoint() == localCheckpointTracker.getMaxSeqNo() we know that the last generation must contain operation as it's size is above the threshold and the threshold is guaranteed to be higher than an empty translog gen by the setting validation. Therefore, flushing will improve things. This is tricky to figure out (which is why I didn't like this approach originally).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS - as far as I can tell this can only happen if the translog generation file size limit is close or above to the flush threshold and we can end up here if one indexes faster then it takes to roll generations. If that's true, can you add this to the comment? This just too subtle but I can't see how to avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment.

*/
public long uncommittedSizeInBytes() {
return sizeInBytesByMinGen(deletionPolicy.getTranslogGenerationOfLastCommit());
public long uncommittedGeneration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be removed (preferably) or made package private for testing? I don't want production code to use this - it just adds circular dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this from Translog and moved this to TestTranslog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I prefer using translog stats :). All removed.

final long seqno = randomLongBetween(Math.max(0, localCheckPoint), localCheckPoint + 5);
final ParsedDocument doc = testParsedDocument(Long.toString(seqno), null, testDocumentWithTextField(), SOURCE, null);
engine.index(replicaIndexForDoc(doc, 1L, seqno, false));
if (rarely() || engine.getTranslog().shouldRollGeneration()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we sometimes skip rolling generations ? (we should not rely on rolling always happening in the test)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (rarely() || engine.getTranslog().shouldRollGeneration()) {
engine.rollTranslogGeneration();
}
if (engine.shouldPeriodicallyFlush()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we sometime flush anyway? people may force flush manually and we want to make sure all works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 21, 2018

@bleskes Can you take another look?

@dnhatn
Copy link
Member Author

dnhatn commented Mar 22, 2018

PS - as far as I can tell this can only happen if the translog generation file size limit is close or above to the flush threshold and we can end up here if one indexes faster then it takes to roll generations. If that's true, can you add this to the comment? This just too subtle but I can't see how to avoid it.

We can avoid if the default generation is not a factor of the flush threshold (eg. adding an empty translog + N * generations != flush). WDYT?

@bleskes
Copy link
Contributor

bleskes commented Mar 22, 2018

We can avoid if the default generation is not a factor of the flush threshold (eg. adding an empty translog + N * generations != flush). WDYT?

I'm not sure I follow. Can you please unpack this?

@dnhatn
Copy link
Member Author

dnhatn commented Mar 22, 2018

@bleskes What happened may be slightly different from your statement. I think an endless loop may have occurred when the uncommitted size is close to the flush threshold, the current generation is also close to the generation threshold, and a faster operation rolls a new generation, then a slower operation gets into an endless loop. If this is the case, the size of N translog files and an empty translog satisfies these conditions:

  • Size of N translog files <= the flush threshold
  • Size of N translog files + an empty translog file > the flush threshold

Assuming that there was no manual flush, these conditions should not be satisfied at the same time if the generation size is not a factor of the flush threshold.

@dnhatn dnhatn added the v6.2.4 label Mar 22, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Mar 22, 2018

Discussed with Boaz on another channel. My last comment is not valid as it's based on the old code while Boaz's on the new code. I've updated comment in 4240241.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 22, 2018

@bleskes Thank you very much for your helpful reviews.

@dnhatn dnhatn merged commit 14157c8 into elastic:master Mar 22, 2018
@dnhatn dnhatn deleted the fix-infinite-flush branch March 22, 2018 18:31
dnhatn added a commit that referenced this pull request Mar 22, 2018
In #28350, we fixed an endless flushing loop which may happen on
replicas by tightening the relation between the flush action and the
periodically flush condition.

1. The periodically flush condition is enabled only if it is disabled
after a flush.

2. If the periodically flush condition is enabled then a flush will
actually happen regardless of Lucene state.

(1) and (2) guarantee that a flushing loop will be terminated. Sadly,
the condition 1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted translog size.

- We use method `uncommittedSizeInBytes` to calculate current
  uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.

- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future
  uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one by
one.

Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and
seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3
while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.

This commit removes both `sizeOfGensAboveSeqNoInBytes` and
`uncommittedSizeInBytes` methods, then enforces an engine to use only
`sizeInBytesByMinGen` method to evaluate the periodically flush condition.

Closes #29097
Relates ##28350
dnhatn added a commit that referenced this pull request Mar 22, 2018
In #28350, we fixed an endless flushing loop which may happen on
replicas by tightening the relation between the flush action and the
periodically flush condition.

1. The periodically flush condition is enabled only if it is disabled
after a flush.

2. If the periodically flush condition is enabled then a flush will
actually happen regardless of Lucene state.

(1) and (2) guarantee that a flushing loop will be terminated. Sadly,
the condition 1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted translog size.

- We use method `uncommittedSizeInBytes` to calculate current
  uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.

- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future
  uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one by
one.

Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and
seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3
while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.

This commit removes both `sizeOfGensAboveSeqNoInBytes` and
`uncommittedSizeInBytes` methods, then enforces an engine to use only
`sizeInBytesByMinGen` method to evaluate the periodically flush condition.

Closes #29097
Relates ##28350
martijnvg added a commit that referenced this pull request Mar 26, 2018
* es/master: (27 commits)
  [Docs] Add rank_eval size parameter k (#29218)
  [DOCS] Remove ignore_z_value parameter link
  Docs: Update docs/index_.asciidoc (#29172)
  Docs: Link C++ client lib elasticlient (#28949)
  [DOCS] Unregister repository instead of deleting it (#29206)
  Docs: HighLevelRestClient#multiSearch (#29144)
  Add Z value support to geo_shape
  Remove type casts in logging in server component (#28807)
  Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878)
  REST : Split `RestUpgradeAction` into two actions (#29124)
  Add error file docs to important settings
  Add note to low-level client docs for DNS caching (#29213)
  Harden periodically check to avoid endless flush loop (#29125)
  Remove deprecated options for query_string (#29203)
  REST high-level client: add force merge API (#28896)
  Remove license information from README.textile (#29198)
  Decouple more classes from XContentBuilder and make builder strict (#29197)
  [Docs] Fix missing closing block in cluster/misc.asciidoc
  RankEvalRequest should implement IndicesRequest (#29188)
  Use EnumMap in ClusterBlocks (#29112)
  ...
martijnvg added a commit that referenced this pull request Mar 26, 2018
* es/6.x: (29 commits)
  [Docs] Add rank_eval size parameter k (#29218)
  Docs: Update docs/index_.asciidoc (#29172)
  Docs: Link C++ client lib elasticlient (#28949)
  Docs: HighLevelRestClient#multiSearch (#29144)
  [DOCS] Remove ignore_z_value parameter link
  Add Z value support to geo_shape
  Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878)
  REST : Split `RestUpgradeAction` into two actions (#29124)
  [DOCS] Unregister repository instead of deleting it (#29206)
  Remove type casts in logging in server component (#28807)
  Add error file docs to important settings
  Add note to low-level client docs for DNS caching (#29213)
  testShrinkAfterUpgrade should only set mapping.single_type if bwc version > 5.5.0
  Harden periodically check to avoid endless flush loop (#29125)
  REST high-level client: add force merge API (#28896)
  Remove license information from README.textile (#29198)
  Decouple more classes from XContentBuilder and make builder strict (#29197)
  Propagate mapping.single_type setting on shrinked index (#29202)
  [Docs] Fix missing closing block in cluster/misc.asciidoc
  RankEvalRequest should implement IndicesRequest (#29188)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.2.4 v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting
4 participants