Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] testRetentionLeasesSyncOnRecovery fails #39105

Closed
hendrikmuhs opened this issue Feb 19, 2019 · 10 comments · Fixed by #42536
Closed

[CI] testRetentionLeasesSyncOnRecovery fails #39105

hendrikmuhs opened this issue Feb 19, 2019 · 10 comments · Fixed by #42536
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@hendrikmuhs
Copy link

IMPORTANT : This test is still muted on master and failure applies to several versions.

Relates to #38764 - but I think this is a new issue.

Old fix: 5fad38e

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+periodic/77/console

The test compares 2 maps, which are identical by content but differently ordered.

Log:

java.lang.AssertionError: 
Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}}>
     but: was <{YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>
	at __randomizedtesting.SeedInfo.seed([D9CD8E08DAD8675F:93D9EB615232A8A8]:0)

/CC @jasontedor

@hendrikmuhs hendrikmuhs added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor
Copy link
Member

Note the test that is failing here is testRetentionLeasesSyncOnRecovery, not testRetentionLeasesSyncOnRemove.

04:44:03  FAILURE 2.10s J1 | RetentionLeaseIT.testRetentionLeasesSyncOnRecovery <<< FAILURES!
04:44:03     > Throwable #1: java.lang.AssertionError: 
04:44:03     > Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}}>
04:44:03     >      but: was <{YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>
04:44:03     > 	at __randomizedtesting.SeedInfo.seed([D9CD8E08DAD8675F:93D9EB615232A8A8]:0)
04:44:03     > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
04:44:03     > 	at org.elasticsearch.index.seqno.RetentionLeaseIT.testRetentionLeasesSyncOnRecovery(RetentionLeaseIT.java:370)
04:44:03     > 	at java.lang.Thread.run(Thread.java:748)

@jasontedor jasontedor changed the title [CI] testRetentionLeaseSyncedOnRemove fails [CI] testRetentionLeasesSyncOnRecovery fails Feb 19, 2019
@jasontedor
Copy link
Member

Also, the maps are not equal, there is a timestamp difference.

Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}
but: was..., SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>

@jasontedor
Copy link
Member

I found the issue, test bug. I will push a fix shortly.

@jasontedor jasontedor self-assigned this Feb 19, 2019
jasontedor added a commit that referenced this issue Feb 19, 2019
This test had a bug. We attempt to allow only the primary to be
allocated, to force all replicas to recovery from the primary after we
had set the state of the retention leases on the primary. However, in
building the index settings, we were overwriting the settings that
exclude the replicas from being allocated. This means that some of the
replicas would end up assigned and rather than receive retention leases
during recovery, they would be part of the replication group receiving
retention leases as they are manipulated. Since retention lease renewals
are only synced periodically, this means that the replica could be
lagging a little behind in some cases leading to an assertion tripping
in the test. This commit addresses this by ensuring that the replicas
are indeed not allocated until after the retention leases are done being
manipulated on the replica. We did this by not overwriting the exclude
settings.

Closes #39105
jasontedor added a commit that referenced this issue Feb 19, 2019
This test had a bug. We attempt to allow only the primary to be
allocated, to force all replicas to recovery from the primary after we
had set the state of the retention leases on the primary. However, in
building the index settings, we were overwriting the settings that
exclude the replicas from being allocated. This means that some of the
replicas would end up assigned and rather than receive retention leases
during recovery, they would be part of the replication group receiving
retention leases as they are manipulated. Since retention lease renewals
are only synced periodically, this means that the replica could be
lagging a little behind in some cases leading to an assertion tripping
in the test. This commit addresses this by ensuring that the replicas
are indeed not allocated until after the retention leases are done being
manipulated on the replica. We did this by not overwriting the exclude
settings.

Closes #39105
jasontedor added a commit that referenced this issue Feb 19, 2019
This test had a bug. We attempt to allow only the primary to be
allocated, to force all replicas to recovery from the primary after we
had set the state of the retention leases on the primary. However, in
building the index settings, we were overwriting the settings that
exclude the replicas from being allocated. This means that some of the
replicas would end up assigned and rather than receive retention leases
during recovery, they would be part of the replication group receiving
retention leases as they are manipulated. Since retention lease renewals
are only synced periodically, this means that the replica could be
lagging a little behind in some cases leading to an assertion tripping
in the test. This commit addresses this by ensuring that the replicas
are indeed not allocated until after the retention leases are done being
manipulated on the replica. We did this by not overwriting the exclude
settings.

Closes #39105
@romseygeek
Copy link
Contributor

This test has failed again on 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/2792/console

java.lang.AssertionError: 
Expected: <{XGXiyXac=RetentionLease{id='XGXiyXac', retainingSequenceNumber=228887311162678927, timestamp=1552991928992, source='OWVEeRER'}, UfXemsvH=RetentionLease{id='UfXemsvH', retainingSequenceNumber=6460015575321227402, timestamp=1552991928992, source='RCaDMWcA'}, yxZjoXOD=RetentionLease{id='yxZjoXOD', retainingSequenceNumber=6274212870258453507, timestamp=1552991928992, source='QeMpImcA'}, VyiIjJYs=RetentionLease{id='VyiIjJYs', retainingSequenceNumber=2133028230013912907, timestamp=1552991928992, source='yQsjFtqy'}, sfjfrVor=RetentionLease{id='sfjfrVor', retainingSequenceNumber=8701391374091472416, timestamp=1552991928992, source='uFWFHpJw'}, pYQTNwQT=RetentionLease{id='pYQTNwQT', retainingSequenceNumber=4919630631661594402, timestamp=1552991928992, source='hdrfFeBC'}, oWrKJWSr=RetentionLease{id='oWrKJWSr', retainingSequenceNumber=7602602061015029303, timestamp=1552991929193, source='ZvAAwqpt'}}>
     but: was <{XGXiyXac=RetentionLease{id='XGXiyXac', retainingSequenceNumber=228887311162678927, timestamp=1552991928992, source='OWVEeRER'}, UfXemsvH=RetentionLease{id='UfXemsvH', retainingSequenceNumber=6460015575321227402, timestamp=1552991928992, source='RCaDMWcA'}, yxZjoXOD=RetentionLease{id='yxZjoXOD', retainingSequenceNumber=6274212870258453507, timestamp=1552991928992, source='QeMpImcA'}, VyiIjJYs=RetentionLease{id='VyiIjJYs', retainingSequenceNumber=2133028230013912907, timestamp=1552991928992, source='yQsjFtqy'}, sfjfrVor=RetentionLease{id='sfjfrVor', retainingSequenceNumber=8701391374091472416, timestamp=1552991928992, source='uFWFHpJw'}, pYQTNwQT=RetentionLease{id='pYQTNwQT', retainingSequenceNumber=4919630631661594402, timestamp=1552991928992, source='hdrfFeBC'}, oWrKJWSr=RetentionLease{id='oWrKJWSr', retainingSequenceNumber=7602602061015029303, timestamp=1552991928992, source='ZvAAwqpt'}}>

@jtibshirani
Copy link
Contributor

This failed again on master intake: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/3053/

Excerpt from logs:

org.elasticsearch.index.seqno.RetentionLeaseIT > testRetentionLeasesSyncOnRecovery FAILED
    java.lang.AssertionError: 
    Expected: <{INNMdoBL=RetentionLease{id='INNMdoBL', retainingSequenceNumber=1244452474034992388, timestamp=1554858598089, source='oVZCIoTx'}, KKnWFaGS=RetentionLease{id='KKnWFaGS', retainingSequenceNumber=1290850561280207735, timestamp=1554858598089, source='VuJZzKIW'}, hovIznXR=RetentionLease{id='hovIznXR', retainingSequenceNumber=5545484624380650799, timestamp=1554858598089, source='jiCmznJs'}, mBhGDLXY=RetentionLease{id='mBhGDLXY', retainingSequenceNumber=6063347294587841909, timestamp=1554858598089, source='SeyuowYm'}, cEBQPZIr=RetentionLease{id='cEBQPZIr', retainingSequenceNumber=2620965069903679675, timestamp=1554858598089, source='jhPyzeGR'}, HIrqcIio=RetentionLease{id='HIrqcIio', retainingSequenceNumber=6143977810763443381, timestamp=1554858598289, source='XaeYAcpt'}}>
         but: was <{INNMdoBL=RetentionLease{id='INNMdoBL', retainingSequenceNumber=1244452474034992388, timestamp=1554858598089, source='oVZCIoTx'}, KKnWFaGS=RetentionLease{id='KKnWFaGS', retainingSequenceNumber=1290850561280207735, timestamp=1554858598089, source='VuJZzKIW'}, hovIznXR=RetentionLease{id='hovIznXR', retainingSequenceNumber=5545484624380650799, timestamp=1554858598089, source='jiCmznJs'}, mBhGDLXY=RetentionLease{id='mBhGDLXY', retainingSequenceNumber=6063347294587841909, timestamp=1554858598089, source='SeyuowYm'}, cEBQPZIr=RetentionLease{id='cEBQPZIr', retainingSequenceNumber=2620965069903679675, timestamp=1554858598089, source='jhPyzeGR'}, HIrqcIio=RetentionLease{id='HIrqcIio', retainingSequenceNumber=6143977810763443381, timestamp=1554858598089, source='XaeYAcpt'}}>
        at __randomizedtesting.SeedInfo.seed([DCDF8E360984B62B:96CBEB5F816E79DC]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.index.seqno.RetentionLeaseIT.testRetentionLeasesSyncOnRecovery(RetentionLeaseIT.java:431)

@jtibshirani jtibshirani reopened this Apr 10, 2019
@dnhatn dnhatn self-assigned this May 17, 2019
@dnhatn
Copy link
Member

dnhatn commented May 20, 2019

I can construct a sequence of events that leads to this failure. I enabled debug log in 0ec7986 to confirm my assumption before proposing the fix.

@jasontedor
Copy link
Member

@dnhatn What is the sequence of events that you constructed?

@dnhatn
Copy link
Member

dnhatn commented May 22, 2019

@jasontedor If the replica allocated on the master node, then the master can remove the primary shard when we manually break the connection. Since we don't persist retention leases on renewal, the primary loads the persisted version from disk which is not the latest version when it's reallocated.

I planned to fix this by persisting retention leases on renewal. However, with #42299, this approach is no longer valid.

@jasontedor
Copy link
Member

@dnhatn I think that we can remove the renewal from the test, it doesn't add anything.

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants