Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

Closed
ywangd opened this issue Aug 17, 2021 · 3 comments · Fixed by #77134
Closed

[CI] SnapshotBasedRecoveryIT testSnapshotBasedRecovery failing #76595

ywangd opened this issue Aug 17, 2021 · 3 comments · Fixed by #77134
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@ywangd
Copy link
Member

ywangd commented Aug 17, 2021

Normally I'd skip timeout errors because they are often just due to luck. But this test is new and already failed 38 times in past few days.

Build scan:
https://gradle-enterprise.elastic.co/s/523u2dz3olx3u

Repro line:
./gradlew ':qa:rolling-upgrade:v7.14.1#oneThirdUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.SnapshotBasedRecoveryIT" -Dtests.method="testSnapshotBasedRecovery" -Dtests.seed=15D06BC7AFD0B42F -Dtests.bwc=true -Dtests.locale=ca-ES -Dtests.timezone=Africa/Banjul -Druntime.java=8

Reproduces locally?:
No

Applicable branches:
7.x

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.relativeStartTime=P7D&search.timeZoneId=Australia/Melbourne&tests.container=org.elasticsearch.upgrades.SnapshotBasedRecoveryIT&tests.sortField=FAILED&tests.test=testSnapshotBasedRecovery&tests.unstableOnly=true

Failure excerpt:



java.lang.AssertionError: timed out waiting for green state for index [snapshot_based_recovery] cluster state [{ |  
-- | --
  | "cluster_name" : "v7.14.1", |  
  | "cluster_uuid" : "vNuyvMUdTfaVJAKtMymzNQ", |  
  | "version" : 620, |  
  | "state_uuid" : "N5K_VdvhSVupuKwpFqXHBw", |  
  | "master_node" : "2YFxMyn6Rrmw9JqR-zgMRg", |  
  | "blocks" : { |  
  | "indices" : { |  
  | "index_mixed_7140199" : { |  
  | "4" : { |  
  | "description" : "index closed", |  
  | "retryable" : false, |  
  | "levels" : [ |  
  | "read", |  
  | "write" |  
  | ] |  
  | } |  
  | }, |  
  | "closed_index_replica_allocation" : { |  
  | "4" : { |  
  | "description" : "index closed", |  
  | "retryable" : false,


@ywangd ywangd added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Aug 17, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 17, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@henningandersen henningandersen self-assigned this Aug 17, 2021
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Aug 17, 2021
Selectively muting parts of the rolling upgrade test for recover from
snapshot.

Relates elastic#76595
@henningandersen
Copy link
Contributor

The failures are genuine. My initial analysis points to the primary ending up on the upgraded node. This is surprising (but may have a valid reason once we dig deeper). I muted this selectively in #76601, with that we should still be validating rolling upgrade works with recovery from snapshot (though less frequently).

henningandersen added a commit that referenced this issue Aug 17, 2021
Selectively muting parts of the rolling upgrade test for recover from
snapshot.

Relates #76595
henningandersen added a commit that referenced this issue Aug 17, 2021
Selectively muting parts of the rolling upgrade test for recover from
snapshot.

Relates #76595
@henningandersen
Copy link
Contributor

Pasting David's comment from #76601 here:

I suspect the problem is caused by a rebalance moving the primary onto the newly-upgraded node, but I haven't seen a failure in captivity to confirm that yet. If so I think we could do something a bit stronger here, e.g. apply an allocation filter to exclude the solitary upgraded node, then explicitly cancel any shards it holds to promote a replica on the old nodes, and then remove replicas.

@fcofdez fcofdez self-assigned this Aug 30, 2021
fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Sep 1, 2021
Move the shard to a replica in an older version when the primary
is located in the upgraded node during the first rolling restart
round.

Closes elastic#76595
fcofdez added a commit that referenced this issue Sep 29, 2021
Move the shard to a replica in an older version when the primary
is located in the upgraded node during the first rolling restart
round.

Closes #76595
fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Sep 29, 2021
Move the shard to a replica in an older version when the primary
is located in the upgraded node during the first rolling restart
round.

Closes elastic#76595
fcofdez added a commit that referenced this issue Sep 29, 2021
Move the shard to a replica in an older version when the primary
is located in the upgraded node during the first rolling restart
round.

Closes #76595
Backport of #77134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants