-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inactive shard flush should wait for ongoing one #89430
Inactive shard flush should wait for ongoing one #89430
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @henningandersen and @DaveCTurner . After investigating the test failure, we (with the help of @fcofdez ) figured out the situation, which is presented in the PR description. It can happen in some rare situations when the SHARD_MEMORY_INTERVAL_TIME_SETTING is set too low and/or if a flush takes too long. There are 3 solutions we discussed so far:
Feel free to tell us your opinion on the selected approach and any other thoughts you may have. |
It does look like the old synced-flush would use A more straightforward solution could also be to move the setting of |
I'm not sure if this solves the issue? Theoretically we could still miss the last flush if the first flush is still running after the remaining documents have been indexed. But maybe I'm missing something here. |
You might be right 🙂 . The idea would be that by marking active after the indexing has occurred, the next round of |
Hi! Thanks for the awesome conversation. I think I agree with @fcofdez . If we follow that approach @henningandersen , I so see a rare situation where:
I think this is also a nice solution. I can try that if you agree. |
Just to add more context here, this only affects in cases where we stop indexing after the latest flush is skipped (a rare edge case)
👍 let's try that. |
Makes sense, the |
460eb61
to
e95404b
Compare
org.elasticsearch.indices.flush.FlushIT#testFlushOnInactive would sometimes fail in the following case: * SHARD_MEMORY_INTERVAL_TIME_SETTING is set very low, e.g., 10ms * The regularly scheduled multiple flushes proceed to org.elasticsearch.index.shard.IndexShard#flushOnIdle * There, the first flush will handle e.g., the first document that was indexed. The second flush will arrive shortly after, before the first flush finishes. * The second flush will find that wasActive = true (due to the indexing of the remaining documents), and will set it to false. * However, the second flush will not be executed because waitIfOngoing = false, and there is the ongoing first flush. * No other flush is scheduled (since any next regularly scheduled flush will find wasActive = false), which creates the problem. Solution: if a flush request does not happen, revert active flag, so that a next flush request can happen. Fixes elastic#87888
e95404b
to
c5509c4
Compare
Hi @fcofdez , @henningandersen . Thanks for the conversation. The method where if flush does not wait for the ongoing flush, it returns false, and we set the active flag back to true, works. I did that -- feel free to review the PR. |
It looks like there are some related test failures, additionally we usually avoid force-pushing since Github gets confused and it hides/removes some review comments. |
Fixed the test. Oh, about the force pushes, I will avoid from now on. Either way the commits get squashed when merging the PR, so indeed I do not see a reason now why I did it :) |
No worries, it's just a trade-off, sometimes it's easier to review a set of clean commits, but I'm not sure if GitHub will ever fix the force-push issue 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directions looks good to me. I have a few comments that I'd like to see addressed though.
server/src/main/java/org/elasticsearch/index/engine/Engine.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a small comment, the direction looks good.
And fix some PR review feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a problem with the new test, otherwise this looks good.
server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Henning Andersen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Fix some javadoc
@elasticmachine run elasticsearch-ci/part-1 please |
org.elasticsearch.indices.flush.FlushIT#testFlushOnInactive would
sometimes fail in the following case:
org.elasticsearch.index.shard.IndexShard#flushOnIdle
that was indexed. The second flush will arrive shortly after,
before the first flush finishes.
indexing of the remaining documents), and will set it to false.
waitIfOngoing = false, and there is the ongoing first flush.
flush will find wasActive = false), which creates the problem.
Fixes #87888