Fill LocalCheckpointTracker with Lucene commit #34474

dnhatn · 2018-10-15T16:56:38Z

Today we rely on the LocalCheckpointTracker to ensure no duplicate when enabling optimization using max_seq_no_of_updates in the FollowingEngine. The problem is that the LocalCheckpointTracker is not fully reloaded when opening an engine with an out-of-order index commit. Suppose the starting commit has seq#0 and seq#2, then the current LocalCheckpointTracker would return "false" when asking if seq#2 was processed before although seq#2 in the commit.

This change scans the existing sequence numbers in the starting commit, then marks these as completed in the LocalCheckpointTracker to achieve the consistent state between LocalCheckpointTracker and Lucene commit.

(This is an unreleased bug in the FollowingEngine of CCR).

Relates #33656

Today we rely on the LocalCheckpointTracker to ensure no duplicate when enabling optimization using max_seq_no_of_updates. The problem is that the LocalCheckpointTracker is not fully reloaded when opening an engine with an out-of-order index commit. Suppose the starting commit has seq#0 and seq#2, then the current LocalCheckpointTracker would return "false" when asking if seq#2 was processed before although seq#2 in the commit. This change scans the existing sequence numbers in the starting commit, then marks these as completed in the LocalCheckpointTracker to ensure the consistent state between LocalCheckpointTracker and Lucene commit.

elasticmachine · 2018-10-15T16:56:40Z

Pinging @elastic/es-distributed

dnhatn · 2018-10-15T16:57:35Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                    final IndexSearcher searcher = new IndexSearcher(reader);
+                    searcher.setQueryCache(null);
+                    final Query query = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, localCheckpoint + 1, maxSeqNo);
+                    for (LeafReaderContext leaf : reader.leaves()) {


@bleskes Let me know if you still prefer using the "Snapshot" API.

bleskes

I left some comments on the testing and an ask for support from the Lucene SWAT team. Other than that, LGTM

bleskes · 2018-10-15T20:43:31Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-        localCheckpoint = seqNoStats.localCheckpoint;
-        logger.trace("recovered maximum sequence number [{}] and local checkpoint [{}]", maxSeqNo, localCheckpoint);
-        return localCheckpointTrackerSupplier.apply(maxSeqNo, localCheckpoint);
+        BiFunction<Long, Long, LocalCheckpointTracker> localCheckpointTrackerSupplier) {


any chance we can make this method static and give it what it needs? I'm afraid we'll use an uninitialized field by mistake (as we call it from the constructor).

+1. I pushed 779c5c7.

bleskes · 2018-10-15T20:47:23Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                        final DocIdSetIterator docIdSetIterator = scorer.iterator();
+                        final NumericDocValues seqNoDocValues = leaf.reader().getNumericDocValues(SeqNoFieldMapper.NAME);
+                        int docId;
+                        while ((docId = docIdSetIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {


This takes another approach than LuceneChangesSnapshot. @jimczi @s1monw can you please double check this?

It should be possible to use the reader's PointsValue directly to speed up the search since they are always indexed.
Something like the following untested snippet:

void markAsCompleted(IndexReader reader, LocalCheckpointTracker tracker, long minSeqNo, long maxSeqNo) throws IOException { for (LeafReaderContext ctx : reader.leaves()) { PointValues values = ctx.reader().getPointValues(SeqNoFieldMapper.NAME); byte[] lowerPoint = new byte[values.getBytesPerDimension()]; LongPoint.encodeDimension(minSeqNo, lowerPoint, 0); byte[] upperPoint = new byte[values.getBytesPerDimension()]; LongPoint.encodeDimension(maxSeqNo, upperPoint, 0); final Bits liveDocs = ctx.reader().getLiveDocs() == null ? new Bits.MatchAllBits(reader.maxDoc()) : ctx.reader().getLiveDocs(); int numBytes = values.getBytesPerDimension(); values.intersect(new PointValues.IntersectVisitor() { @Override public void visit(int docID) throws IOException { throw new IllegalStateException("should never be called"); } @Override public void visit(int docID, byte[] packedValue) throws IOException { if (liveDocs.get(docID) == false) { return; } if (FutureArrays.compareUnsigned(packedValue, 0, numBytes, lowerPoint, 0, numBytes) < 0) { // Doc's value is too low, in this dimension return; } if (FutureArrays.compareUnsigned(packedValue, 0, numBytes, upperPoint, 0, numBytes) > 0) { // Doc's value is too high, in this dimension return; } long seqNo = LongPoint.decodeDimension(packedValue, 0); tracker.markSeqNoAsCompleted(seqNo); } @Override public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { if (FutureArrays.compareUnsigned(minPackedValue, 0, numBytes, upperPoint, 0, numBytes) > 0 || FutureArrays.compareUnsigned(maxPackedValue, 0, numBytes, lowerPoint, 0, numBytes) < 0) { return PointValues.Relation.CELL_OUTSIDE_QUERY; } return PointValues.Relation.CELL_CROSSES_QUERY; } }); } }

This would avoid the need to execute a query to read from doc values but I am not sure if it's worth the complexity. If only few documents should be visited this is probably useless and using a query is the right approach.

I pull this in, and it works perfectly. However, I am not sure if we should use it either. Normally we would expect to visit a few to several hundred documents.

I agree this is probably a premature optimization. I wrote it mainly to see how it looks but this shouldn't bring much if the number of documents to visit is low. The query+doc_values approach seems easier to maintain.

bleskes · 2018-10-15T20:55:57Z

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java

@@ -279,6 +279,33 @@ public void testRetryBulkShardOperations() throws Exception {
        }
    }

+    public void testOutOfOrderOnFollower() throws Exception {


it's not about out of order but rather over recovery of ops you already have. I checked and I can't find any existing tests that cover recovery (which can cover this case). I think we should have some (i.e., rename this test and extend it to cover general peer recovery (with and without holes). We also need concurrent recovery + indexing.

Yes, I'll do.

jimczi · 2018-10-15T22:09:34Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                        final DocIdSetIterator docIdSetIterator = scorer.iterator();
+                        final NumericDocValues seqNoDocValues = leaf.reader().getNumericDocValues(SeqNoFieldMapper.NAME);
+                        int docId;
+                        while ((docId = docIdSetIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {


It should be possible to use the reader's PointsValue directly to speed up the search since they are always indexed.
Something like the following untested snippet:

void markAsCompleted(IndexReader reader, LocalCheckpointTracker tracker, long minSeqNo, long maxSeqNo) throws IOException { for (LeafReaderContext ctx : reader.leaves()) { PointValues values = ctx.reader().getPointValues(SeqNoFieldMapper.NAME); byte[] lowerPoint = new byte[values.getBytesPerDimension()]; LongPoint.encodeDimension(minSeqNo, lowerPoint, 0); byte[] upperPoint = new byte[values.getBytesPerDimension()]; LongPoint.encodeDimension(maxSeqNo, upperPoint, 0); final Bits liveDocs = ctx.reader().getLiveDocs() == null ? new Bits.MatchAllBits(reader.maxDoc()) : ctx.reader().getLiveDocs(); int numBytes = values.getBytesPerDimension(); values.intersect(new PointValues.IntersectVisitor() { @Override public void visit(int docID) throws IOException { throw new IllegalStateException("should never be called"); } @Override public void visit(int docID, byte[] packedValue) throws IOException { if (liveDocs.get(docID) == false) { return; } if (FutureArrays.compareUnsigned(packedValue, 0, numBytes, lowerPoint, 0, numBytes) < 0) { // Doc's value is too low, in this dimension return; } if (FutureArrays.compareUnsigned(packedValue, 0, numBytes, upperPoint, 0, numBytes) > 0) { // Doc's value is too high, in this dimension return; } long seqNo = LongPoint.decodeDimension(packedValue, 0); tracker.markSeqNoAsCompleted(seqNo); } @Override public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { if (FutureArrays.compareUnsigned(minPackedValue, 0, numBytes, upperPoint, 0, numBytes) > 0 || FutureArrays.compareUnsigned(maxPackedValue, 0, numBytes, lowerPoint, 0, numBytes) < 0) { return PointValues.Relation.CELL_OUTSIDE_QUERY; } return PointValues.Relation.CELL_CROSSES_QUERY; } }); } }

This would avoid the need to execute a query to read from doc values but I am not sure if it's worth the complexity. If only few documents should be visited this is probably useless and using a query is the right approach.

jimczi · 2018-10-15T22:10:43Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                    searcher.setQueryCache(null);
+                    final Query query = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, localCheckpoint + 1, maxSeqNo);
+                    for (LeafReaderContext leaf : reader.leaves()) {
+                        final Scorer scorer = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 1.0f).scorer(leaf);


You can create the weight once for all leaves and reuse it to build the Scorer ?

Thanks @jimczi. I pushed 4255bee

s1monw · 2018-10-16T12:30:32Z

Question, would it be enough to bootstrap the max_seq_no_of_updates with the max seqID in the index instead?

dnhatn · 2018-10-16T13:15:05Z

Question, would it be enough to bootstrap the max_seq_no_of_updates with the max seqID in the index instead?

@s1monw Yes, it's good enough.
Boaz, Yannick and I discussed two options: bootstrapping max_seq_no_of_updates with max_seq_no and filling LocalCheckpointTracker. We prefer to implement the latter so we can maintain the consistency between LocalCheckpointTracker and the content of the index.

s1monw

approach LGTM

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

s1monw · 2018-10-16T13:26:02Z

@s1monw Yes, it's good enough.
Boaz, Yannick and I discussed two options: bootstrapping max_seq_no_of_updates with max_seq_no and filling LocalCheckpointTracker. We prefer to implement the latter so we can maintain the consistency between LocalCheckpointTracker and the content of the index.

can you leave a comment about this?

dnhatn · 2018-10-16T14:37:18Z

@bleskes I've added two tests which add new replicas during indexing. Could you please have another look?

bleskes

LGTM. Left some minor suggestions. Feel free to reject.

x-pack/plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/IndexFollowingIT.java

bleskes · 2018-10-16T20:21:15Z

server/src/test/java/org/elasticsearch/index/replication/RecoveryDuringReplicationTests.java

+    public void testAddNewReplicas() throws Exception {
+        try (ReplicationGroup shards = createGroup(between(0, 1))) {
+            shards.startAll();
+            Thread[] threads = new Thread[between(1, 8)];


maybe reduce the number of threads?

I've reduced to between(1,3).

server/src/test/java/org/elasticsearch/index/replication/RecoveryDuringReplicationTests.java

bleskes · 2018-10-17T06:58:58Z

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java

+            @Override
+            protected EngineFactory getEngineFactory(ShardRouting routing) {
+                if (routing.primary()) {
+                    return new InternalEngineFactory(); // use the internal engine so we can index directly


why can't we use the standard following engine and supply ops in the same way we do for real following? the goal is to check that everything works together like it should.

@bleskes Yes, we can but I chose not to because we have an integration test does the same thing (https://github.com/elastic/elasticsearch/pull/34474/files#diff-d36012b5317805e8b3ff77b77dbaa5ecR672). I am fine to update this. WDYT?

ok, I didn't want to fully spec out the testing, but I would have ideally loved to see a randomized non multi threaded test using replication groups that create all kind of scenarios (like the one you have caught). The multi threaded test can stay in integration land and make sure that if we miss a scenario something will find it. Does that make sense?

@bleskes Thanks for the explanation 👍. I will update this test.

dnhatn · 2018-10-17T22:00:20Z

@bleskes I've updated the test in b8c1a4e. Can you please have another look?

bleskes

LGTM

dnhatn · 2018-10-19T16:37:29Z

Thanks @bleskes @jimczi and @s1monw.

Today we rely on the LocalCheckpointTracker to ensure no duplicate when enabling optimization using max_seq_no_of_updates. The problem is that the LocalCheckpointTracker is not fully reloaded when opening an engine with an out-of-order index commit. Suppose the starting commit has seq#0 and seq#2, then the current LocalCheckpointTracker would return "false" when asking if seq#2 was processed before although seq#2 in the commit. This change scans the existing sequence numbers in the starting commit, then marks these as completed in the LocalCheckpointTracker to ensure the consistent state between LocalCheckpointTracker and Lucene commit.

Since elastic#34412 and elastic#34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates elastic#34412 Relates elastic#34474

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

Today we rely on the LocalCheckpointTracker to ensure no duplicate when enabling optimization using max_seq_no_of_updates. The problem is that the LocalCheckpointTracker is not fully reloaded when opening an engine with an out-of-order index commit. Suppose the starting commit has seq#0 and seq#2, then the current LocalCheckpointTracker would return "false" when asking if seq#2 was processed before although seq#2 in the commit. This change scans the existing sequence numbers in the starting commit, then marks these as completed in the LocalCheckpointTracker to ensure the consistent state between LocalCheckpointTracker and Lucene commit.

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

A CCR test failure shows that the approach in #34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates #34474 Relates #33656

…ic#35230) A CCR test failure shows that the approach in elastic#34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates elastic#34474 Relates elastic#33656

In #34474, we added a new assertion to ensure that the LocalCheckpointTracker is always consistent with Lucene index. However, we reset LocalCheckpoinTracker in testDedupByPrimaryTerm cause this assertion to be violated. This commit removes resetCheckpoint from LocalCheckpointTracker and rewrites testDedupByPrimaryTerm without resetting the local checkpoint. Relates #34474

dnhatn added >enhancement v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.5.0 labels Oct 15, 2018

dnhatn requested review from s1monw, bleskes and ywelsch October 15, 2018 16:56

dnhatn commented Oct 15, 2018

View reviewed changes

bleskes suggested changes Oct 15, 2018

View reviewed changes

make it static

779c5c7

jimczi reviewed Oct 15, 2018

View reviewed changes

create weight once

4255bee

s1monw approved these changes Oct 16, 2018

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java Show resolved Hide resolved

dnhatn added 3 commits October 16, 2018 10:21

add new replicas test

3f71ea1

move to Lucene

dfaa477

Merge branch 'master' into checkpoint-tracker

8564162

dnhatn requested a review from bleskes October 16, 2018 14:37

bleskes approved these changes Oct 16, 2018

View reviewed changes

dnhatn added 3 commits October 16, 2018 16:51

Merge branch 'master' into checkpoint-tracker

46f25d9

boaz’s feedback

7ed3566

add following engine test

2dfb511

bleskes reviewed Oct 17, 2018

View reviewed changes

Merge branch 'master' into checkpoint-tracker

4d8aa66

dnhatn added 2 commits October 17, 2018 17:54

Add following recovery test

b8c1a4e

wording

a880bd7

bleskes approved these changes Oct 18, 2018

View reviewed changes

Merge branch 'master' into checkpoint-tracker

2347bd5

dnhatn merged commit 90ca5b1 into elastic:master Oct 19, 2018

dnhatn deleted the checkpoint-tracker branch October 19, 2018 16:38

dnhatn added the backport pending label Oct 19, 2018

dnhatn removed the backport pending label Oct 20, 2018

dnhatn mentioned this pull request Oct 20, 2018

Remove LocalCheckpointTracker#resetCheckpoint #34667

Merged

dnhatn mentioned this pull request Oct 22, 2018

CCR: Requires soft-deletes on the follower #34725

Merged

kcm pushed a commit that referenced this pull request Oct 30, 2018

CCR: Requires soft-deletes on the follower (#34725)

458fbb0

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

colings86 removed the :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features label Nov 2, 2018

dnhatn mentioned this pull request Nov 3, 2018

Use soft-deleted docs to resolve strategy for engine operation #35230

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dnhatn mentioned this pull request Jun 7, 2019

[CI] InternalEngineTests.testLookupSeqNoByIdInLucene fails after prune ID merge policy #42979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill LocalCheckpointTracker with Lucene commit #34474

Fill LocalCheckpointTracker with Lucene commit #34474

dnhatn commented Oct 15, 2018 •

edited

Loading

elasticmachine commented Oct 15, 2018

dnhatn Oct 15, 2018

bleskes left a comment

bleskes Oct 15, 2018

dnhatn Oct 15, 2018

bleskes Oct 15, 2018

jimczi Oct 15, 2018

dnhatn Oct 15, 2018

jimczi Oct 15, 2018

bleskes Oct 15, 2018

dnhatn Oct 15, 2018

jimczi Oct 15, 2018

jimczi Oct 15, 2018

dnhatn Oct 15, 2018

s1monw commented Oct 16, 2018

dnhatn commented Oct 16, 2018

s1monw left a comment

s1monw commented Oct 16, 2018

dnhatn commented Oct 16, 2018

bleskes left a comment

bleskes Oct 16, 2018

dnhatn Oct 16, 2018

bleskes Oct 17, 2018

dnhatn Oct 17, 2018

bleskes Oct 17, 2018

dnhatn Oct 17, 2018

dnhatn commented Oct 17, 2018

bleskes left a comment

dnhatn commented Oct 19, 2018

Fill LocalCheckpointTracker with Lucene commit #34474

Fill LocalCheckpointTracker with Lucene commit #34474

Conversation

dnhatn commented Oct 15, 2018 • edited Loading

elasticmachine commented Oct 15, 2018

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Oct 16, 2018

dnhatn commented Oct 16, 2018

s1monw left a comment

Choose a reason for hiding this comment

s1monw commented Oct 16, 2018

dnhatn commented Oct 16, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Oct 17, 2018

bleskes left a comment

Choose a reason for hiding this comment

dnhatn commented Oct 19, 2018

dnhatn commented Oct 15, 2018 •

edited

Loading