-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposed engine must include all operations below global checkpoint during rollback #36159
Conversation
…ring rollback Today we expose a new engine immediately during Lucene rollback. The new engine is started with a safe commit which might not include all acknowledged operation. With this change, we won't expose the new engine until it has recovered from the local translog. Note that this solution is not complete since it's able to reserve only acknowledged operations before the global checkpoint. This is because we replay translog up to the global checkpoint during rollback. A per-doc Lucene rollback would solve this issue entirely.
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left 2 smaller comments around the order of things and mutexes. Looking good otherwise
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
@ywelsch Two very good catches. It's ready for another round. Would you please have another look? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left 2 comments. LGTM otherwise
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
synchronized (mutex) { | ||
// we must create a new engine under mutex (see IndexShard#snapshotStoreMetadata). | ||
newEngine = engineFactory.newReadWriteEngine(newEngineConfig()); | ||
onNewEngine(newEngine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onNewEngine does't publish the new engine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onNewEngine
does not expose the engine itself but exposes only the last refresh translog location to RefreshListeners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks everyone :) |
Today we expose a new engine immediately during Lucene rollback. The new engine is started with a safe commit which might not include all acknowledged operation. With this change, we won't expose the new engine until it has recovered from the local translog. Note that this solution is not complete since it's able to reserve only acknowledged operations before the global checkpoint. This is because we replay translog up to the global checkpoint during rollback. A per-doc Lucene rollback would solve this issue entirely. Relates #32867
* elastic/6.x: (37 commits) [HLRC] Added support for Follow Stats API (elastic#36253) Exposed engine must have all ops below gcp during rollback (elastic#36159) TEST: Always enable soft-deletes in ShardChangesTests Use delCount of SegmentInfos to calculate numDocs (elastic#36323) Add soft-deletes upgrade tests (elastic#36286) Remove LocalCheckpointTracker#resetCheckpoint (elastic#34667) Option to use endpoints starting with _security (elastic#36379) [CCR] Restructured QA modules (elastic#36404) RestClient: on retry timeout add root exception (elastic#25576) [HLRC] Add support for put privileges API (elastic#35679) HLRC: Add rollup search (elastic#36334) Explicitly recommend to forceMerge before freezing (elastic#36376) Rename internal repository actions to be internal (elastic#36377) Core: Remove parseDefaulting from DateFormatter (elastic#36386) [ML] Prevent stack overflow while copying ML jobs and datafeeds (elastic#36370) Docs: Fix Jackson reference (elastic#36366) [ILM] Fix issue where index may not yet be in 'hot' phase (elastic#35716) Undeprecate /_watcher endpoints (elastic#36269) Docs: Fix typo in bool query (elastic#36350) HLRC: Add delete template API (elastic#36320) ...
Today we expose a new engine immediately during Lucene rollback. The new engine is started with a safe commit which might not include all acknowledged operation. With this change, we won't expose the new engine until it has recovered from the local translog.
Note that this solution is not complete since it's able to reserve only acknowledged operations before the global checkpoint. This is because we replay translog up to the global checkpoint during rollback. A per-doc Lucene rollback would solve this issue entirely.
Relates #32867