Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

Closed
martijnvg opened this issue Apr 30, 2021 · 5 comments · Fixed by #75583
Closed

[CI] FollowingEngineTests testOptimizeMultipleVersions failing #72527

martijnvg opened this issue Apr 30, 2021 · 5 comments · Fixed by #75583
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@martijnvg
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/ul4broxaituxw/tests/:x-pack:plugin:ccr:test/org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests/testOptimizeMultipleVersions

Reproduction line:
./gradlew ':x-pack:plugin:ccr:test' --tests "org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.testOptimizeMultipleVersions" -Dtests.seed=D5DEA1124A7EA74B -Dtests.locale=bg-BG -Dtests.timezone=Mexico/BajaNorte -Druntime.java=11 -Dtests.fips.enabled=true

Applicable branches:
master, 7.x

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests&tests.test=testOptimizeMultipleVersions

Failure excerpt:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=153, name=Thread-115, state=RUNNABLE, group=TGRP-FollowingEngineTests]

  at __randomizedtesting.SeedInfo.seed([D5DEA1124A7EA74B:4EA15F6C2F3BA23]:0)

  Caused by: java.lang.AssertionError: maxSeqNoOfUpdates is not advanced local_checkpoint=80 msu=80 seq_no=81

    at __randomizedtesting.SeedInfo.seed([D5DEA1124A7EA74B]:0)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngine.advanceMaxSeqNoOfUpdatesOrDeletesOnPrimary(FollowingEngine.java:125)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:895)
    at org.elasticsearch.index.engine.EngineTestCase.applyOperation(EngineTestCase.java:1004)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.fetchOperations(FollowingEngineTests.java:534)
    at org.elasticsearch.xpack.ccr.index.engine.FollowingEngineTests.lambda$runFollowTest$13(FollowingEngineTests.java:465)
    at java.lang.Thread.run(Thread.java:829)

@martijnvg martijnvg added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI labels Apr 30, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@martijnvg
Copy link
Member Author

This test failed 4 times in the last 90 days: https://build-stats.elastic.co/goto/8cd6df90651053be872090be9477a948

@henningandersen
Copy link
Contributor

I wonder if this case:

            plan = IndexingStrategy.processButSkipLucene(false, index.version());

combined with the check in index:

                        final boolean toAppend = plan.indexIntoLucene && plan.useLuceneUpdateDocument == false;
                        if (toAppend == false) {
                            advanceMaxSeqNoOfUpdatesOrDeletesOnPrimary(index.seqNo());
                        }

could lead to this assertion firing. The check in index is slightly lenient in that it says to advance even if not indexing the document at all. In the non-indexing case it is fine on primary/leader to update the max seq-no (is a conservative estimate anyway), but the assertion on follower will fail.

@fcofdez
Copy link
Contributor

fcofdez commented Jun 25, 2021

This was difficult to reproduce.

The scenario that's causing this failure can be summarized as:

  • Doc with id x was deleted with seqNo 2
  • Doc with id y is deleted with seqNo 4 and maxSeqNoOfUpdatesOrDeletes=4 and localCheckpoint=3
  • Doc with id x is indexed with seqNo 5 concurrently with the previous delete
    andlocalCheckpoint=3 && maxSeqNoOfUpdatesOrDeletes=4
    • The plan for this indexing operation has to check the versionMap since
      localCheckpoint=3 && maxSeqNoOfUpdatesOrDeletes=4, normally this would
      result in an append operation, but in the FollowingEngine we use the
      replica planner, in that case we don't check if the previous version was a
      delete (see
      ).
      This ends up interpreted as an update and it tries to advance the
      maxSeqNoOfUpdatesOrDeletes to 5 tripping the assertion only if the
      operation in the doc with id y has ended in that interval meaning that
      localCheckpoint == maxSeqNoOfUpdatesOrDeletes

I created a test that reproduces the issue in https://github.com/elastic/elasticsearch/compare/master...fcofdez:reproduce-failure?expand=1

For me it seems wrong that we're not checking if the present version was a delete in


but this behaviour doesn't affect the correctness of the system, so maybe we should relax
the assertions in FollowingEngine?
wdyt @henningandersen?

@henningandersen
Copy link
Contributor

Great find @fcofdez . I agree to your evaluation.

Can we add a more systematic test catching the case?

I would prefer to relax the assertion to handle this case, at least for now. I do think it would be safe to rely on the delete flag since we check this while holding a lock for the id and having checked the maxSeqNoOfUpdatesAndDeletes. But it parts away from the principle imposes by maxSeqNoOfUpdatesAndDeletes and will certainly require more investigation and validation before we do this (do not have time to think this through in details right now). Also, it is likely not worthwhile the effort, since the optimization here mainly concerns itself with mostly append-only workloads. That said, we may want to return to it in case it does pop up in the future.

The comment here may need refinement. "extra safe in production code" signals to me that it really never does anything. But with the change done previously to compare against local checkpoint and also now the relaxation done for this issue, we should probably refine that to explain that the update should not matter in production code.

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Jul 21, 2021
In certain concurrent indexing scenarios where there are deletes
executed and then a new indexing operation, the following engine
considers those as updates breaking one of the assumed invariants.

Closes elastic#72527
fcofdez added a commit that referenced this issue Sep 2, 2021
In certain concurrent indexing scenarios where there are deletes
executed and then a new indexing operation, the following engine
considers those as updates breaking one of the assumed invariants.

Closes #72527
fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Sep 2, 2021
In certain concurrent indexing scenarios where there are deletes
executed and then a new indexing operation, the following engine
considers those as updates breaking one of the assumed invariants.

Closes elastic#72527
fcofdez added a commit that referenced this issue Sep 2, 2021
In certain concurrent indexing scenarios where there are deletes
executed and then a new indexing operation, the following engine
considers those as updates breaking one of the assumed invariants.

Closes #72527
Backport of #75583
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants