Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize pre-closing checks for engine implementations #38702

Merged
merged 3 commits into from
Feb 11, 2019

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Feb 11, 2019

This pull request allows engine implementations to perform specialized sanity checks during the closing of index shards.

Co-authored-by: Martijn van Groningen <martijn.v.groningen@**.com>

@tlrx tlrx added >enhancement blocker v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.7.0 v8.0.0 v7.2.0 labels Feb 11, 2019
@tlrx tlrx requested a review from ywelsch February 11, 2019 09:52
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left three smaller comments on naming and structure, looking good o.w.

@tlrx
Copy link
Member Author

tlrx commented Feb 11, 2019

Thanks @ywelsch - I've applied your feedback.

@tlrx tlrx requested a review from ywelsch February 11, 2019 10:36
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlrx tlrx merged commit 514a762 into elastic:master Feb 11, 2019
@tlrx tlrx deleted the pre-close-checks branch February 11, 2019 13:27
@tlrx
Copy link
Member Author

tlrx commented Feb 11, 2019

Thanks @ywelsch and @martijnvg

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 11, 2019
The Close Index API has been refactored in 6.7.0 and it now performs 
pre-closing sanity checks on shards before an index is closed: the maximum 
sequence number must be equals to the global checkpoint. While this is a 
strong requirement for regular shards, we identified the need to relax this 
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence 
number or global checkpoint, which are pulled from a leader shard. They 
also fetch and process batches of operations from the leader in an unordered 
way, potentially leaving gaps in the history of ops. If the following shard lags 
a lot it's possible that the global checkpoint and max seq number never get 
in sync, preventing the following shard to be closed and a new PUT Follow 
action to be issued on this shard (which is our recommended way to 
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific 
verification it must perform before closing the index. In order to allow 
following/frozen/closed shards to be closed whatever the max seq number 
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do 
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 11, 2019
The Close Index API has been refactored in 6.7.0 and it now performs 
pre-closing sanity checks on shards before an index is closed: the maximum 
sequence number must be equals to the global checkpoint. While this is a 
strong requirement for regular shards, we identified the need to relax this 
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence 
number or global checkpoint, which are pulled from a leader shard. They 
also fetch and process batches of operations from the leader in an unordered 
way, potentially leaving gaps in the history of ops. If the following shard lags 
a lot it's possible that the global checkpoint and max seq number never get 
in sync, preventing the following shard to be closed and a new PUT Follow 
action to be issued on this shard (which is our recommended way to 
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific 
verification it must perform before closing the index. In order to allow 
following/frozen/closed shards to be closed whatever the max seq number 
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do 
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Feb 11, 2019
The Close Index API has been refactored in 6.7.0 and it now performs
pre-closing sanity checks on shards before an index is closed: the maximum
sequence number must be equals to the global checkpoint. While this is a
strong requirement for regular shards, we identified the need to relax this
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence
number or global checkpoint, which are pulled from a leader shard. They
also fetch and process batches of operations from the leader in an unordered
way, potentially leaving gaps in the history of ops. If the following shard lags
a lot it's possible that the global checkpoint and max seq number never get
in sync, preventing the following shard to be closed and a new PUT Follow
action to be issued on this shard (which is our recommended way to
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific
verification it must perform before closing the index. In order to allow
following/frozen/closed shards to be closed whatever the max seq number
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>
tlrx added a commit that referenced this pull request Feb 11, 2019
…8722)

The Close Index API has been refactored in 6.7.0 and it now performs 
pre-closing sanity checks on shards before an index is closed: the maximum 
sequence number must be equals to the global checkpoint. While this is a 
strong requirement for regular shards, we identified the need to relax this 
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence 
number or global checkpoint, which are pulled from a leader shard. They 
also fetch and process batches of operations from the leader in an unordered 
way, potentially leaving gaps in the history of ops. If the following shard lags 
a lot it's possible that the global checkpoint and max seq number never get 
in sync, preventing the following shard to be closed and a new PUT Follow 
action to be issued on this shard (which is our recommended way to 
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific 
verification it must perform before closing the index. In order to allow 
following/frozen/closed shards to be closed whatever the max seq number 
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do 
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>
tlrx added a commit that referenced this pull request Feb 11, 2019
…8723)

The Close Index API has been refactored in 6.7.0 and it now performs 
pre-closing sanity checks on shards before an index is closed: the maximum 
sequence number must be equals to the global checkpoint. While this is a 
strong requirement for regular shards, we identified the need to relax this 
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence 
number or global checkpoint, which are pulled from a leader shard. They 
also fetch and process batches of operations from the leader in an unordered 
way, potentially leaving gaps in the history of ops. If the following shard lags 
a lot it's possible that the global checkpoint and max seq number never get 
in sync, preventing the following shard to be closed and a new PUT Follow 
action to be issued on this shard (which is our recommended way to 
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific 
verification it must perform before closing the index. In order to allow 
following/frozen/closed shards to be closed whatever the max seq number 
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do 
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>
tlrx added a commit that referenced this pull request Feb 11, 2019
…8727)

The Close Index API has been refactored in 6.7.0 and it now performs
pre-closing sanity checks on shards before an index is closed: the maximum
sequence number must be equals to the global checkpoint. While this is a
strong requirement for regular shards, we identified the need to relax this
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence
number or global checkpoint, which are pulled from a leader shard. They
also fetch and process batches of operations from the leader in an unordered
way, potentially leaving gaps in the history of ops. If the following shard lags
a lot it's possible that the global checkpoint and max seq number never get
in sync, preventing the following shard to be closed and a new PUT Follow
action to be issued on this shard (which is our recommended way to
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific
verification it must perform before closing the index. In order to allow
following/frozen/closed shards to be closed whatever the max seq number
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <[email protected]>

This commit also contains #37426.
Related #33888
tlrx added a commit that referenced this pull request Feb 26, 2019
Now the test `CloseFollowerIndexIT` has been added in #38702, it needs to 
be adapted for replicated closed indices.

The test closes the follower index which is lagging behind the leader index. 
When it's closed, no sanity checks are executed because it's a follower index 
(this is a consequence of #38702). But with replicated closed indices, the index
 is reinitialized as a closed index with a `NoOpEngine` and such engines make 
strong assertions on the values of the maximum sequence number and the 
global checkpoint. Since the values do not match, the shards cannot be created 
and fail and the cluster health turns RED.

This commit adapts the `CloseFollowerIndexIT` test so that it wraps the 
default `UncaughtExceptionHandler` with a handler that tolerates any exception 
thrown by `ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint()`. 
Replacing the default uncaught exception handler requires specific permissions,
 and instead of creating another gradle project it duplicates the 
`internalClusterTest` task to make it work without security manager for this 
specific test only.

Relates to #33888
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Mar 1, 2019
Now the test `CloseFollowerIndexIT` has been added in elastic#38702, it needs to 
be adapted for replicated closed indices.

The test closes the follower index which is lagging behind the leader index. 
When it's closed, no sanity checks are executed because it's a follower index 
(this is a consequence of elastic#38702). But with replicated closed indices, the index
 is reinitialized as a closed index with a `NoOpEngine` and such engines make 
strong assertions on the values of the maximum sequence number and the 
global checkpoint. Since the values do not match, the shards cannot be created 
and fail and the cluster health turns RED.

This commit adapts the `CloseFollowerIndexIT` test so that it wraps the 
default `UncaughtExceptionHandler` with a handler that tolerates any exception 
thrown by `ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint()`. 
Replacing the default uncaught exception handler requires specific permissions,
 and instead of creating another gradle project it duplicates the 
`internalClusterTest` task to make it work without security manager for this 
specific test only.

Relates to elastic#33888
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants