Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] Do not use all permits acquisition in replica promotion tests #35862

Closed
wants to merge 1 commit into from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Nov 23, 2018

The pull request #35540 changed the IndexShardTests to randomize usages between the acquisition of a single permit vs the acquisition of all permits when executing an operation on a replica. Although it works for most tests cases, the tests testRestoreLocalHistoryFromTranslogOnPromotion() and testRollbackReplicaEngineOnPromotion() promotes a replica and this promotion triggers a primary term bump which requires by nature all permits to be executed. This does not play well if the randomized usage of replica permit acquisition method already acquired all the permits when promoting the replica.

This pull request unmute the tests and replaces the randomized usage with the single permit acquisition method like it was before #35540.

Closes #35850

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.6.0 labels Nov 23, 2018
@tlrx tlrx requested a review from bleskes November 23, 2018 15:20
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member Author

tlrx commented Nov 23, 2018

@elasticmachine test this please

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes
Copy link
Contributor

bleskes commented Nov 26, 2018

@TRLX I think I need more info. This worries me as we want the exclusive replication action to behave exactly like a normal replication action - with all the rollback semantics etc. Can unpack a bit what you meant with "This does not play well"?

@tlrx
Copy link
Member Author

tlrx commented Nov 26, 2018

Can unpack a bit what you meant with "This does not play well"?

@bleskes Sure, and thanks for insisting. The test is testing the rollback of the replica engine when the replica got promoted (ie when the primary term is increased). The logic to bump the primary term and to rollback the engine requires all the index shard permits, so as soon as we detect that the primary term must be modified the other operations will be delayed (this is a property of the asyncBlockOperations() method). This guarantees that a single permit replication action will be executed after the primary term is bumped/replica engine is rollbacked/global checkpoint updated.

With the change made in #35540, the replication action randomly requires all permits to be executed. In those cases the replication action and the bumping term logic are competing for the acquisition of all permits, without the guarantee that the replication action will be executed after the bump of primary/engine rollback etc.

Now I'm writing this I'm wondering if we should instead change the innerAcquireReplicaOperationPermit() logic to ensure that the replication action is always executed after the bump/rollback of the engine so that it does not rely on the delaying of ops in IndexShardOperationPermits.

@tlrx
Copy link
Member Author

tlrx commented Nov 26, 2018

Now I'm writing this I'm wondering if we should instead change the innerAcquireReplicaOperationPermit() logic to ensure that the replication action is always executed after the bump/rollback of the engine so that it does not rely on the delaying of ops in IndexShardOperationPermits.

@bleskes I was thinking of something like tlrx@57f6c69 . Let me know what you think

@tlrx
Copy link
Member Author

tlrx commented Dec 4, 2018

Closed in favor of #36116

@tlrx tlrx closed this Dec 4, 2018
@tlrx tlrx deleted the fix-35850 branch December 4, 2018 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants