-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tests] Do not use all permits acquisition in replica promotion tests #35862
Conversation
Pinging @elastic/es-distributed |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@TRLX I think I need more info. This worries me as we want the exclusive replication action to behave exactly like a normal replication action - with all the rollback semantics etc. Can unpack a bit what you meant with "This does not play well"? |
@bleskes Sure, and thanks for insisting. The test is testing the rollback of the replica engine when the replica got promoted (ie when the primary term is increased). The logic to bump the primary term and to rollback the engine requires all the index shard permits, so as soon as we detect that the primary term must be modified the other operations will be delayed (this is a property of the asyncBlockOperations() method). This guarantees that a single permit replication action will be executed after the primary term is bumped/replica engine is rollbacked/global checkpoint updated. With the change made in #35540, the replication action randomly requires all permits to be executed. In those cases the replication action and the bumping term logic are competing for the acquisition of all permits, without the guarantee that the replication action will be executed after the bump of primary/engine rollback etc. Now I'm writing this I'm wondering if we should instead change the innerAcquireReplicaOperationPermit() logic to ensure that the replication action is always executed after the bump/rollback of the engine so that it does not rely on the delaying of ops in IndexShardOperationPermits. |
@bleskes I was thinking of something like tlrx@57f6c69 . Let me know what you think |
Closed in favor of #36116 |
The pull request #35540 changed the IndexShardTests to randomize usages between the acquisition of a single permit vs the acquisition of all permits when executing an operation on a replica. Although it works for most tests cases, the tests
testRestoreLocalHistoryFromTranslogOnPromotion()
andtestRollbackReplicaEngineOnPromotion()
promotes a replica and this promotion triggers a primary term bump which requires by nature all permits to be executed. This does not play well if the randomized usage of replica permit acquisition method already acquired all the permits when promoting the replica.This pull request unmute the tests and replaces the randomized usage with the single permit acquisition method like it was before #35540.
Closes #35850