Lucene merges should run on the target shard during recovery #10463

mikemccand · 2015-04-07T15:45:41Z

This is already fixed on 2.0, since we let Lucene launch its own merges again.

But in 1.x, Lucene merges might not run on the target during recovery, causing segment explosion when there are many docs to replay and/or the index buffer is low. This then makes recovery time O(N^2) and can cause issues like #9226.

I just moved launching of the mergeScheduleFuture out of startScheduledTasksIfNeeded (only called once recovery is done) and into createNewEngine. This way whenever the engine is created we also start checking for merges.

I also renamed startScheduledTasksIfNeeded -> startEngineRefresher, and cleaned up a couple unrelated things.

mikemccand · 2015-04-07T16:20:44Z

I moved the mergeScheduleFuture creation to ctor, so now we create it once when the IndexShard is created, not in newEngine.

And I fixed EngineMerge to use engineUnsafe and skip merging if engine is currently null...

bleskes · 2015-04-07T19:32:46Z

LGTM

This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463

This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463

This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463

This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463

always launch the mergeScheduleFuture when we create the engine

46beb80

mikemccand added v1.6.0 v1.5.1 labels Apr 7, 2015

mikemccand self-assigned this Apr 7, 2015

fix typo

00c8bde

always create mergeScheduleFuture in ctor

a44ef58

mikemccand added the >bug label Apr 7, 2015

mikemccand closed this Apr 7, 2015

sylvae mentioned this pull request Apr 9, 2015

Shard stuck in relocating state with recovery stage=translog #9226

Closed

clintongormley added the :Core/Infra/Core Core issues without another label label Apr 9, 2015

kimchy added the v1.4.5 label Apr 11, 2015

clintongormley changed the title ~~Core: Lucene merges should run on the target shard during recovery~~ Lucene merges should run on the target shard during recovery May 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene merges should run on the target shard during recovery #10463

Lucene merges should run on the target shard during recovery #10463

mikemccand commented Apr 7, 2015

mikemccand commented Apr 7, 2015

bleskes commented Apr 7, 2015

Lucene merges should run on the target shard during recovery #10463

Lucene merges should run on the target shard during recovery #10463

Conversation

mikemccand commented Apr 7, 2015

mikemccand commented Apr 7, 2015

bleskes commented Apr 7, 2015