-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for prewarm when relocating searchable snapshot shards #65531
Wait for prewarm when relocating searchable snapshot shards #65531
Conversation
Add hooks to enable waiting for a condition before relocation handoff. Make timeout on relocation unbounded and add ability to disable recovery liveness checker temporarily while running prewarm.
Pinging @elastic/es-distributed (Team:Distributed) |
// Due to relocation conditions on the shard it could take a while for the hand-off to complete so we disable the recovery | ||
// monitor since we don't expect any transport messages from master for the duration of the handoff and activate it again | ||
// after the handoff. | ||
final Releasable disabledMonitor = recoveryRef.target().disableRecoveryMonitor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a BwC issue I guess. If the hand-off request comes from 7.10
and doesn't wait indefinitely yet then it could timeout on the primary I guess but maybe we can just ignore it since it's so fringe?
recoveryMonitorEnabled = false; | ||
return () -> { | ||
setLastAccessTime(); | ||
recoveryMonitorEnabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little low-tech relative to the tricky ref-counting in the IndexShard
. I figured this was ok here since the hand-off request only comes in once (at least judging by the assertions we have in IndexShard
) while the other API has a more "feel" to it and there are no hard guarantees on the index shard state listener only being invoked once (though the "loaded" flag on the directory effectively guarantees we only add one condition for now) and it wasn't that much extra effort since the API was supposed to be non-blocking anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial read of the production code. I am in doubt about the approach taken, since we block all operations on the source during the relocation (in IndexShard.relocate
). I think this will prevent other recoveries from initiating until the request is done. For searchable snapshots with replicas this is unfortunate. Also, any operation happening trying to acquire the primary permit will be queued up.
It may not be too important for searchable snapshots, but it seems counter intuitive to wait in this specific critical operation rather than outside it? Maybe we could hook this into finalizeRecovery
instead?
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
} | ||
lastSeenAccessTime = accessTime; | ||
} else { | ||
lastSeenAccessTime = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be simpler to just fake the progress inside RecoveryTarget
by returning System.nanoTime()
?
Thanks @henningandersen I adjusted this PR now to work via the clean files handler like we discussed. Let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
}); | ||
|
||
logger.info("--> sleep for 5s to ensure we are actually stuck at the FINALIZE stage and that the primary has not yet relocated"); | ||
TimeUnit.SECONDS.sleep(5L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead find the shard using internalCluster().getInstance(IndicesService.class, node)
and assertBusy that it has a pending after cleanup action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ thanks, you actually prevented a likely test failure here as well :) I moved the check for translog stage to a busy assert and then added the check for one clean files condition after. Otherwise we'd only have had 5s to arrive at TRANSLOG
now at least we have 10 which should be a little safer.
...ts/src/test/java/org/elasticsearch/index/store/cache/CachedBlobContainerIndexInputTests.java
Outdated
Show resolved
Hide resolved
Thanks Henning! |
…65531) Add hooks to enable waiting for a condition before completing the clean files step for relocating searchable snapshot shards and use them to wait for pre-warm before responding to the clean files request.
This commit introduces a change where searchable snapshots skip the TRANSLOG stage. Since elastic#65531 was introduced, the cleanFiles peer recovery stage is blocked until the prewarming completes (this is done to avoid search latency spikes due to a cold cache). In that phase, the RecoveryState stage is TRANSLOG which can be confusing as we don't replay any ops during searchable snapshots recoveries. In order to avoid that confusion we transition directly to FINALIZE stage.
This commit introduces a change where searchable snapshots skip the RecoveryState TRANSLOG stage. Since #65531 was introduced, the cleanFiles peer recovery phase is blocked until the prewarming completes (this is done to avoid search latency spikes due to a cold cache). In that phase, the RecoveryState stage is TRANSLOG which can be confusing as we don't replay any ops during searchable snapshots recoveries. In order to avoid that confusion we transition directly to FINALIZE stage.
This commit introduces a change where searchable snapshots skip the RecoveryState TRANSLOG stage. Since elastic#65531 was introduced, the cleanFiles peer recovery phase is blocked until the prewarming completes (this is done to avoid search latency spikes due to a cold cache). In that phase, the RecoveryState stage is TRANSLOG which can be confusing as we don't replay any ops during searchable snapshots recoveries. In order to avoid that confusion we transition directly to FINALIZE stage. Backport of elastic#70311
This commit introduces a change where searchable snapshots skip the RecoveryState TRANSLOG stage. Since #65531 was introduced, the cleanFiles peer recovery phase is blocked until the prewarming completes (this is done to avoid search latency spikes due to a cold cache). In that phase, the RecoveryState stage is TRANSLOG which can be confusing as we don't replay any ops during searchable snapshots recoveries. In order to avoid that confusion we transition directly to FINALIZE stage. Backport of #70311
Add hooks to enable waiting for a condition before completing the clean files step for relocating searchable snapshot shards and use them to wait for pre-warm before responding to the clean files request.