[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

ywangd · 2021-08-17T04:52:15Z

Normally I'd skip timeout errors because they are often just due to luck. But this test is new and already failed 38 times in past few days.

Build scan:
https://gradle-enterprise.elastic.co/s/523u2dz3olx3u

Repro line:
./gradlew ':qa:rolling-upgrade:v7.14.1#oneThirdUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.SnapshotBasedRecoveryIT" -Dtests.method="testSnapshotBasedRecovery" -Dtests.seed=15D06BC7AFD0B42F -Dtests.bwc=true -Dtests.locale=ca-ES -Dtests.timezone=Africa/Banjul -Druntime.java=8

Reproduces locally?:
No

Applicable branches:
7.x

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Australia/Melbourne&tests.container=org.elasticsearch.upgrades.SnapshotBasedRecoveryIT&tests.sortField=FAILED&tests.test=testSnapshotBasedRecovery&tests.unstableOnly=true

Failure excerpt:



java.lang.AssertionError: timed out waiting for green state for index [snapshot_based_recovery] cluster state [{ |  
-- | --
  | "cluster_name" : "v7.14.1", |  
  | "cluster_uuid" : "vNuyvMUdTfaVJAKtMymzNQ", |  
  | "version" : 620, |  
  | "state_uuid" : "N5K_VdvhSVupuKwpFqXHBw", |  
  | "master_node" : "2YFxMyn6Rrmw9JqR-zgMRg", |  
  | "blocks" : { |  
  | "indices" : { |  
  | "index_mixed_7140199" : { |  
  | "4" : { |  
  | "description" : "index closed", |  
  | "retryable" : false, |  
  | "levels" : [ |  
  | "read", |  
  | "write" |  
  | ] |  
  | } |  
  | }, |  
  | "closed_index_replica_allocation" : { |  
  | "4" : { |  
  | "description" : "index closed", |  
  | "retryable" : false,

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-17T04:52:17Z

Pinging @elastic/es-distributed (Team:Distributed)

Selectively muting parts of the rolling upgrade test for recover from snapshot. Relates elastic#76595

henningandersen · 2021-08-17T07:44:19Z

The failures are genuine. My initial analysis points to the primary ending up on the upgraded node. This is surprising (but may have a valid reason once we dig deeper). I muted this selectively in #76601, with that we should still be validating rolling upgrade works with recovery from snapshot (though less frequently).

Selectively muting parts of the rolling upgrade test for recover from snapshot. Relates #76595

henningandersen · 2021-08-17T12:00:35Z

Pasting David's comment from #76601 here:

I suspect the problem is caused by a rebalance moving the primary onto the newly-upgraded node, but I haven't seen a failure in captivity to confirm that yet. If so I think we could do something a bit stronger here, e.g. apply an allocation filter to exclude the solitary upgraded node, then explicitly cancel any shards it holds to promote a replica on the old nodes, and then remove replicas.

Move the shard to a replica in an older version when the primary is located in the upgraded node during the first rolling restart round. Closes elastic#76595

Move the shard to a replica in an older version when the primary is located in the upgraded node during the first rolling restart round. Closes #76595

Move the shard to a replica in an older version when the primary is located in the upgraded node during the first rolling restart round. Closes elastic#76595

Move the shard to a replica in an older version when the primary is located in the upgraded node during the first rolling restart round. Closes #76595 Backport of #77134

ywangd added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Aug 17, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 17, 2021

henningandersen self-assigned this Aug 17, 2021

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Aug 17, 2021

Mute recover from snapshot rolling first round

05ed2da

Selectively muting parts of the rolling upgrade test for recover from snapshot. Relates elastic#76595

henningandersen mentioned this issue Aug 17, 2021

Mute recover from snapshot rolling first round #76601

Merged

henningandersen added a commit that referenced this issue Aug 17, 2021

Mute recover from snapshot rolling first round (#76601)

2bfeab6

Selectively muting parts of the rolling upgrade test for recover from snapshot. Relates #76595

henningandersen added a commit that referenced this issue Aug 17, 2021

Mute recover from snapshot rolling first round (#76601)

bc278d6

Selectively muting parts of the rolling upgrade test for recover from snapshot. Relates #76595

henningandersen mentioned this issue Aug 18, 2021

[CI] SnapshotBasedIndexRecoveryIT testRecoveryIsCancelledAfterDeletingTheIndex failing #76560

Closed

fcofdez self-assigned this Aug 30, 2021

fcofdez mentioned this issue Sep 1, 2021

Fix SnapshotBasedRecoveryIT#testSnapshotBasedRecovery #77134

Merged

fcofdez closed this as completed in #77134 Sep 29, 2021

fcofdez added a commit that referenced this issue Sep 29, 2021

Fix SnapshotBasedRecoveryIT#testSnapshotBasedRecovery (#77134)

2fdb5a8

Move the shard to a replica in an older version when the primary is located in the upgraded node during the first rolling restart round. Closes #76595

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

ywangd commented Aug 17, 2021

elasticmachine commented Aug 17, 2021

henningandersen commented Aug 17, 2021

henningandersen commented Aug 17, 2021

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

Comments

ywangd commented Aug 17, 2021

elasticmachine commented Aug 17, 2021

henningandersen commented Aug 17, 2021

henningandersen commented Aug 17, 2021