[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

martijnvg · 2021-04-30T08:25:43Z

Build scan:
https://gradle-enterprise.elastic.co/s/ul4broxaituxw/tests/:x-pack:plugin:ccr:test/org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests/testOptimizeMultipleVersions

Reproduction line:
./gradlew ':x-pack:plugin:ccr:test' --tests "org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.testOptimizeMultipleVersions" -Dtests.seed=D5DEA1124A7EA74B -Dtests.locale=bg-BG -Dtests.timezone=Mexico/BajaNorte -Druntime.java=11 -Dtests.fips.enabled=true

Applicable branches:
master, 7.x

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests&tests.test=testOptimizeMultipleVersions

Failure excerpt:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=153, name=Thread-115, state=RUNNABLE, group=TGRP-FollowingEngineTests]

  at __randomizedtesting.SeedInfo.seed([D5DEA1124A7EA74B:4EA15F6C2F3BA23]:0)

  Caused by: java.lang.AssertionError: maxSeqNoOfUpdates is not advanced local_checkpoint=80 msu=80 seq_no=81

    at __randomizedtesting.SeedInfo.seed([D5DEA1124A7EA74B]:0)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngine.advanceMaxSeqNoOfUpdatesOrDeletesOnPrimary(FollowingEngine.java:125)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:895)
    at org.elasticsearch.index.engine.EngineTestCase.applyOperation(EngineTestCase.java:1004)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.fetchOperations(FollowingEngineTests.java:534)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.lambda$runFollowTest$13(FollowingEngineTests.java:465)
    at java.lang.Thread.run(Thread.java:829)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-30T08:25:45Z

Pinging @elastic/es-distributed (Team:Distributed)

martijnvg · 2021-04-30T08:26:14Z

This test failed 4 times in the last 90 days: https://build-stats.elastic.co/goto/8cd6df90651053be872090be9477a948

Relates to elastic#72527

henningandersen · 2021-05-18T12:36:26Z

I wonder if this case:

            plan = IndexingStrategy.processButSkipLucene(false, index.version());

combined with the check in index:

                        final boolean toAppend = plan.indexIntoLucene && plan.useLuceneUpdateDocument == false;
                        if (toAppend == false) {
                            advanceMaxSeqNoOfUpdatesOrDeletesOnPrimary(index.seqNo());
                        }

could lead to this assertion firing. The check in index is slightly lenient in that it says to advance even if not indexing the document at all. In the non-indexing case it is fine on primary/leader to update the max seq-no (is a conservative estimate anyway), but the assertion on follower will fail.

fcofdez · 2021-06-25T09:52:04Z

This was difficult to reproduce.

The scenario that's causing this failure can be summarized as:

Doc with id x was deleted with seqNo 2
Doc with id y is deleted with seqNo 4 and maxSeqNoOfUpdatesOrDeletes=4 and localCheckpoint=3
Doc with id x is indexed with seqNo 5 concurrently with the previous delete
andlocalCheckpoint=3 && maxSeqNoOfUpdatesOrDeletes=4
- The plan for this indexing operation has to check the versionMap since
  localCheckpoint=3 && maxSeqNoOfUpdatesOrDeletes=4, normally this would
  result in an append operation, but in the FollowingEngine we use the
  replica planner, in that case we don't check if the previous version was a
  delete (see
  
  elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
  
  Line 702 in 7cedc3e
  
  return OpVsLuceneDocStatus.OP_NEWER;
  
  ).
  This ends up interpreted as an update and it tries to advance the
  maxSeqNoOfUpdatesOrDeletes to 5 tripping the assertion only if the
  operation in the doc with id y has ended in that interval meaning that
  localCheckpoint == maxSeqNoOfUpdatesOrDeletes

I created a test that reproduces the issue in https://github.com/elastic/elasticsearch/compare/master...fcofdez:reproduce-failure?expand=1

For me it seems wrong that we're not checking if the present version was a delete in

elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

Line 702 in 7cedc3e

return OpVsLuceneDocStatus.OP_NEWER;

but this behaviour doesn't affect the correctness of the system, so maybe we should relax
the assertions in FollowingEngine?
wdyt @henningandersen?

henningandersen · 2021-07-08T07:07:20Z

Great find @fcofdez . I agree to your evaluation.

Can we add a more systematic test catching the case?

I would prefer to relax the assertion to handle this case, at least for now. I do think it would be safe to rely on the delete flag since we check this while holding a lock for the id and having checked the maxSeqNoOfUpdatesAndDeletes. But it parts away from the principle imposes by maxSeqNoOfUpdatesAndDeletes and will certainly require more investigation and validation before we do this (do not have time to think this through in details right now). Also, it is likely not worthwhile the effort, since the optimization here mainly concerns itself with mostly append-only workloads. That said, we may want to return to it in case it does pop up in the future.

The comment here may need refinement. "extra safe in production code" signals to me that it really never does anything. But with the change done previously to compare against local checkpoint and also now the relaxation done for this issue, we should probably refine that to explain that the update should not matter in production code.

In certain concurrent indexing scenarios where there are deletes executed and then a new indexing operation, the following engine considers those as updates breaking one of the assumed invariants. Closes elastic#72527

In certain concurrent indexing scenarios where there are deletes executed and then a new indexing operation, the following engine considers those as updates breaking one of the assumed invariants. Closes #72527

In certain concurrent indexing scenarios where there are deletes executed and then a new indexing operation, the following engine considers those as updates breaking one of the assumed invariants. Closes elastic#72527

In certain concurrent indexing scenarios where there are deletes executed and then a new indexing operation, the following engine considers those as updates breaking one of the assumed invariants. Closes #72527 Backport of #75583

martijnvg added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI labels Apr 30, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 30, 2021

fcofdez self-assigned this Apr 30, 2021

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue May 12, 2021

Add extra assertions on FollowingEngineTests#fetchOperations

6d34b90

Relates to elastic#72527

fcofdez mentioned this issue May 12, 2021

Add extra assertions on FollowingEngineTests#fetchOperations #72997

Closed

fcofdez mentioned this issue Jul 21, 2021

Fix FollowingEngineTests#testOptimizeMultipleVersions #75583

Merged

fcofdez closed this as completed in #75583 Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

martijnvg commented Apr 30, 2021

elasticmachine commented Apr 30, 2021

martijnvg commented Apr 30, 2021

henningandersen commented May 18, 2021

fcofdez commented Jun 25, 2021

henningandersen commented Jul 8, 2021

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

Comments

martijnvg commented Apr 30, 2021

elasticmachine commented Apr 30, 2021

martijnvg commented Apr 30, 2021

henningandersen commented May 18, 2021

fcofdez commented Jun 25, 2021

henningandersen commented Jul 8, 2021