-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden periodically check to avoid endless flush loop #29125
Conversation
In elastic#28350, we fixed an endless flushing loop which can happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it will be disabled after a flush. 2. If the periodically flush condition is true then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition elastic#1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs gen1={elastic#1,elastic#2}, gen2={}, gen3={elastic#3} and seqno=elastic#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit ensures sizeOfGensAboveSeqNoInBytes use the same algorithm from uncommittedSizeInBytes Closes elastic#29097
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch. The source of this is Too-Many-Ways-To-Do-The-Same-Thing and I don't think we quite removed it. I.e., the translog still exposes different methods that do slightly different things and it's highly likely this will happen again. This is caused by the attempt to transition into a sequence numbers based translog recovery related API. I don't think it will happen soon and I don't feel happy with leaving the hybrid state as is (we can do a full transition when and if we're ready).
I suggest removing the translog's uncommittedOps and uncommittedBytes and only expose the sizeInBytesByMinGen and a new equivalent totalOpsByMinGen. The engine will have to do the sequence number juggling using the getMinGenerationForSeqNo method. The API will feel less friendly but at least the logic is not spread out in multiple methods.
I also think we can have a stronger test - set the flush threshold to something smallish and the randomly index and flush (both based on a periodic check and just because). After each flush we should check that shouldPeriodicallyFlush returns false.
WDYT?
Yes, I will definitely give this a try.
Ok, I will add a new test for this. |
@bleskes I've removed both uncommittedOps and uncommittedBytes methods from translog and added a stress test for shouldPeriodicallyFlush. This test was failed around 25% without the patch. Please have a look, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn . I think this is the right approach. Test looks much better. I left more comments.
@@ -1361,7 +1361,8 @@ final boolean tryRenewSyncCommit() { | |||
ensureOpen(); | |||
ensureCanFlush(); | |||
String syncId = lastCommittedSegmentInfos.getUserData().get(SYNC_COMMIT_ID); | |||
if (syncId != null && translog.uncommittedOperations() == 0 && indexWriter.hasUncommittedChanges()) { | |||
if (syncId != null && indexWriter.hasUncommittedChanges() | |||
&& translog.totalOperationsByMinGen(translog.uncommittedGeneration()) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extract the uncommitted gen from the lastCommittedSegmentInfos
? Also uncommitted gen is confusing because the gen's id is in the commit point.
@@ -1383,19 +1384,20 @@ final boolean tryRenewSyncCommit() { | |||
@Override | |||
public boolean shouldPeriodicallyFlush() { | |||
ensureOpen(); | |||
final long translogGenerationOfCurrentCommit = translog.uncommittedGeneration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please get this from the current commit. I don't think it makes sense for the engine to get this from the translog. I can know on it's own!
final long uncommittedSizeOfNewCommit = translog.sizeOfGensAboveSeqNoInBytes(localCheckpointTracker.getCheckpoint() + 1); | ||
return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit; | ||
final long translogGenerationOfNewCommit = | ||
translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1, false).translogFileGeneration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with the extra boolean flag to include / exclude the current generation as a fall back. It's too subtle an error prone. How about doing the following (I think you had it in the past and we moved away from it towards the uncommittedX api - sorry for that):
- If the min gen for the local checkpoint + 1 is > current committed gen , return true.
- If the min gen is equal to the current translog gen, the current gen is not empty (using
totalOperationsByMinGen
) and the local checkpoint is equal to the max seq#, return true.
assertThat(engine.shouldPeriodicallyFlush(), equalTo(false)); | ||
} | ||
} catch (EngineException ex) { | ||
// This happened because the test may have opened too many files (max 2048 fds on test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no good :) maybe change the retention policy to clean up?
indexSettings.updateIndexMetaData(indexMetaData); | ||
engine.onSettingsChanged(); | ||
final int iterations = scaledRandomIntBetween(100, 1000); | ||
final List<Long> pendingSeqNo = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of pendingSeqNo, randomg take a seq# from the range (localCheckpoint:localCheckpoint+5] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boaz, thanks for this hint. This change makes the test failed more than 50% without the patch and also eliminated the file descriptor issue.
@bleskes I've addressed your comments. Can you have another look? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @dnhatn . I left some nits. Code looks good.
* This condition will change if the `uncommittedSize` of the new commit is smaller than | ||
* the `uncommittedSize` of the current commit. This method is to maintain translog only, | ||
* thus the IndexWriter#hasUncommittedChanges condition is not considered. | ||
* We should only flush ony if the shouldFlush condition can become false after flushing. This condition will change if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - we flush also it will reduce the size of uncommitted gens but strictly speaking it doesn't mean it will be below the threshold
return uncommittedSizeOfNewCommit < uncommittedSizeOfCurrentCommit; | ||
final long translogGenerationOfNewCommit = | ||
translog.getMinGenerationForSeqNo(localCheckpointTracker.getCheckpoint() + 1).translogFileGeneration; | ||
return translogGenerationOfLastCommit < translogGenerationOfNewCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment that if translogGenerationOfLastCommit== translogGenerationOfNewCommit
and localCheckpointTracker.getCheckpoint() == localCheckpointTracker.getMaxSeqNo()
we know that the last generation must contain operation as it's size is above the threshold and the threshold is guaranteed to be higher than an empty translog gen by the setting validation. Therefore, flushing will improve things. This is tricky to figure out (which is why I didn't like this approach originally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS - as far as I can tell this can only happen if the translog generation file size limit is close or above to the flush threshold and we can end up here if one indexes faster then it takes to roll generations. If that's true, can you add this to the comment? This just too subtle but I can't see how to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment.
*/ | ||
public long uncommittedSizeInBytes() { | ||
return sizeInBytesByMinGen(deletionPolicy.getTranslogGenerationOfLastCommit()); | ||
public long uncommittedGeneration() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be removed (preferably) or made package private for testing? I don't want production code to use this - it just adds circular dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this from Translog
and moved this to TestTranslog
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I prefer using translog stats :). All removed.
final long seqno = randomLongBetween(Math.max(0, localCheckPoint), localCheckPoint + 5); | ||
final ParsedDocument doc = testParsedDocument(Long.toString(seqno), null, testDocumentWithTextField(), SOURCE, null); | ||
engine.index(replicaIndexForDoc(doc, 1L, seqno, false)); | ||
if (rarely() || engine.getTranslog().shouldRollGeneration()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we sometimes skip rolling generations ? (we should not rely on rolling always happening in the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (rarely() || engine.getTranslog().shouldRollGeneration()) { | ||
engine.rollTranslogGeneration(); | ||
} | ||
if (engine.shouldPeriodicallyFlush()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we sometime flush anyway? people may force flush manually and we want to make sure all works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@bleskes Can you take another look? |
We can avoid if the default generation is not a factor of the flush threshold (eg. adding an empty translog + N * generations != flush). WDYT? |
I'm not sure I follow. Can you please unpack this? |
@bleskes What happened may be slightly different from your statement. I think an endless loop may have occurred when the uncommitted size is close to the flush threshold, the current generation is also close to the generation threshold, and a faster operation rolls a new generation, then a slower operation gets into an endless loop. If this is the case, the size of N translog files and an empty translog satisfies these conditions:
Assuming that there was no manual flush, these conditions should not be satisfied at the same time if the generation size is not a factor of the flush threshold. |
Discussed with Boaz on another channel. My last comment is not valid as it's based on the old code while Boaz's on the new code. I've updated comment in 4240241. |
@bleskes Thank you very much for your helpful reviews. |
In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it is disabled after a flush. 2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee that a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3 while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit removes both `sizeOfGensAboveSeqNoInBytes` and `uncommittedSizeInBytes` methods, then enforces an engine to use only `sizeInBytesByMinGen` method to evaluate the periodically flush condition. Closes #29097 Relates ##28350
In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it is disabled after a flush. 2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee that a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3 while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit removes both `sizeOfGensAboveSeqNoInBytes` and `uncommittedSizeInBytes` methods, then enforces an engine to use only `sizeInBytesByMinGen` method to evaluate the periodically flush condition. Closes #29097 Relates ##28350
* es/master: (27 commits) [Docs] Add rank_eval size parameter k (#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (#29172) Docs: Link C++ client lib elasticlient (#28949) [DOCS] Unregister repository instead of deleting it (#29206) Docs: HighLevelRestClient#multiSearch (#29144) Add Z value support to geo_shape Remove type casts in logging in server component (#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878) REST : Split `RestUpgradeAction` into two actions (#29124) Add error file docs to important settings Add note to low-level client docs for DNS caching (#29213) Harden periodically check to avoid endless flush loop (#29125) Remove deprecated options for query_string (#29203) REST high-level client: add force merge API (#28896) Remove license information from README.textile (#29198) Decouple more classes from XContentBuilder and make builder strict (#29197) [Docs] Fix missing closing block in cluster/misc.asciidoc RankEvalRequest should implement IndicesRequest (#29188) Use EnumMap in ClusterBlocks (#29112) ...
* es/6.x: (29 commits) [Docs] Add rank_eval size parameter k (#29218) Docs: Update docs/index_.asciidoc (#29172) Docs: Link C++ client lib elasticlient (#28949) Docs: HighLevelRestClient#multiSearch (#29144) [DOCS] Remove ignore_z_value parameter link Add Z value support to geo_shape Change BroadcastResponse from ToXContentFragment to ToXContentObject (#28878) REST : Split `RestUpgradeAction` into two actions (#29124) [DOCS] Unregister repository instead of deleting it (#29206) Remove type casts in logging in server component (#28807) Add error file docs to important settings Add note to low-level client docs for DNS caching (#29213) testShrinkAfterUpgrade should only set mapping.single_type if bwc version > 5.5.0 Harden periodically check to avoid endless flush loop (#29125) REST high-level client: add force merge API (#28896) Remove license information from README.textile (#29198) Decouple more classes from XContentBuilder and make builder strict (#29197) Propagate mapping.single_type setting on shrinked index (#29202) [Docs] Fix missing closing block in cluster/misc.asciidoc RankEvalRequest should implement IndicesRequest (#29188) ...
In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition.
The periodically flush condition is enabled only if it is disabled after a flush.
If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state.
(1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size.
We use method
uncommittedSizeInBytes
to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size.We use method
sizeOfGensAboveSeqNoInBytes
to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one.Suppose we have 3 translogs
gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1
,uncommittedSizeInBytes
is the sum of gen1, gen2, and gen3 whilesizeOfGensAboveSeqNoInBytes
is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1.This commit removes both
sizeOfGensAboveSeqNoInBytes
anduncommittedSizeInBytes
methods, then enforces an engine to use onlysizeInBytesByMinGen
method to evaluate the periodically flush condition.Closes #29097
Relates ##28350