[CI] testRetentionLeasesSyncOnRecovery fails #39105

hendrikmuhs · 2019-02-19T10:46:19Z

IMPORTANT : This test is still muted on master and failure applies to several versions.

Relates to #38764 - but I think this is a new issue.

Old fix: 5fad38e

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+periodic/77/console

The test compares 2 maps, which are identical by content but differently ordered.

Log:

java.lang.AssertionError: 
Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}}>
     but: was <{YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>
	at __randomizedtesting.SeedInfo.seed([D9CD8E08DAD8675F:93D9EB615232A8A8]:0)

/CC @jasontedor

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-19T10:46:24Z

Pinging @elastic/es-distributed

jasontedor · 2019-02-19T13:05:27Z

Note the test that is failing here is testRetentionLeasesSyncOnRecovery, not testRetentionLeasesSyncOnRemove.

04:44:03  FAILURE 2.10s J1 | RetentionLeaseIT.testRetentionLeasesSyncOnRecovery <<< FAILURES!
04:44:03     > Throwable #1: java.lang.AssertionError: 
04:44:03     > Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}}>
04:44:03     >      but: was <{YzHeeQlt=RetentionLease{id='YzHeeQlt', retainingSequenceNumber=6253389575976647540, timestamp=1550569432169, source='lHsrBCLa'}, NZMLgOsO=RetentionLease{id='NZMLgOsO', retainingSequenceNumber=9123738011503356278, timestamp=1550569432169, source='hIqIRyLV'}, eFvarfWn=RetentionLease{id='eFvarfWn', retainingSequenceNumber=5153974965419564929, timestamp=1550569432169, source='pcjCyKmZ'}, ycwCgVUO=RetentionLease{id='ycwCgVUO', retainingSequenceNumber=1537622719951565698, timestamp=1550569432169, source='mWZziHld'}, SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>
04:44:03     > 	at __randomizedtesting.SeedInfo.seed([D9CD8E08DAD8675F:93D9EB615232A8A8]:0)
04:44:03     > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
04:44:03     > 	at org.elasticsearch.index.seqno.RetentionLeaseIT.testRetentionLeasesSyncOnRecovery(RetentionLeaseIT.java:370)
04:44:03     > 	at java.lang.Thread.run(Thread.java:748)

jasontedor · 2019-02-19T13:08:01Z

Also, the maps are not equal, there is a timestamp difference.

Expected: <{SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432369, source='vbgznlgo'}

but: was..., SaqDQEfE=RetentionLease{id='SaqDQEfE', retainingSequenceNumber=4119153756517795675, timestamp=1550569432169, source='vbgznlgo'}}>

jasontedor · 2019-02-19T13:50:50Z

I found the issue, test bug. I will push a fix shortly.

This test had a bug. We attempt to allow only the primary to be allocated, to force all replicas to recovery from the primary after we had set the state of the retention leases on the primary. However, in building the index settings, we were overwriting the settings that exclude the replicas from being allocated. This means that some of the replicas would end up assigned and rather than receive retention leases during recovery, they would be part of the replication group receiving retention leases as they are manipulated. Since retention lease renewals are only synced periodically, this means that the replica could be lagging a little behind in some cases leading to an assertion tripping in the test. This commit addresses this by ensuring that the replicas are indeed not allocated until after the retention leases are done being manipulated on the replica. We did this by not overwriting the exclude settings. Closes #39105

romseygeek · 2019-03-19T11:12:22Z

This test has failed again on 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/2792/console

java.lang.AssertionError: 
Expected: <{XGXiyXac=RetentionLease{id='XGXiyXac', retainingSequenceNumber=228887311162678927, timestamp=1552991928992, source='OWVEeRER'}, UfXemsvH=RetentionLease{id='UfXemsvH', retainingSequenceNumber=6460015575321227402, timestamp=1552991928992, source='RCaDMWcA'}, yxZjoXOD=RetentionLease{id='yxZjoXOD', retainingSequenceNumber=6274212870258453507, timestamp=1552991928992, source='QeMpImcA'}, VyiIjJYs=RetentionLease{id='VyiIjJYs', retainingSequenceNumber=2133028230013912907, timestamp=1552991928992, source='yQsjFtqy'}, sfjfrVor=RetentionLease{id='sfjfrVor', retainingSequenceNumber=8701391374091472416, timestamp=1552991928992, source='uFWFHpJw'}, pYQTNwQT=RetentionLease{id='pYQTNwQT', retainingSequenceNumber=4919630631661594402, timestamp=1552991928992, source='hdrfFeBC'}, oWrKJWSr=RetentionLease{id='oWrKJWSr', retainingSequenceNumber=7602602061015029303, timestamp=1552991929193, source='ZvAAwqpt'}}>
     but: was <{XGXiyXac=RetentionLease{id='XGXiyXac', retainingSequenceNumber=228887311162678927, timestamp=1552991928992, source='OWVEeRER'}, UfXemsvH=RetentionLease{id='UfXemsvH', retainingSequenceNumber=6460015575321227402, timestamp=1552991928992, source='RCaDMWcA'}, yxZjoXOD=RetentionLease{id='yxZjoXOD', retainingSequenceNumber=6274212870258453507, timestamp=1552991928992, source='QeMpImcA'}, VyiIjJYs=RetentionLease{id='VyiIjJYs', retainingSequenceNumber=2133028230013912907, timestamp=1552991928992, source='yQsjFtqy'}, sfjfrVor=RetentionLease{id='sfjfrVor', retainingSequenceNumber=8701391374091472416, timestamp=1552991928992, source='uFWFHpJw'}, pYQTNwQT=RetentionLease{id='pYQTNwQT', retainingSequenceNumber=4919630631661594402, timestamp=1552991928992, source='hdrfFeBC'}, oWrKJWSr=RetentionLease{id='oWrKJWSr', retainingSequenceNumber=7602602061015029303, timestamp=1552991928992, source='ZvAAwqpt'}}>

jtibshirani · 2019-04-10T01:31:12Z

This failed again on master intake: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/3053/

Excerpt from logs:

org.elasticsearch.index.seqno.RetentionLeaseIT > testRetentionLeasesSyncOnRecovery FAILED
    java.lang.AssertionError: 
    Expected: <{INNMdoBL=RetentionLease{id='INNMdoBL', retainingSequenceNumber=1244452474034992388, timestamp=1554858598089, source='oVZCIoTx'}, KKnWFaGS=RetentionLease{id='KKnWFaGS', retainingSequenceNumber=1290850561280207735, timestamp=1554858598089, source='VuJZzKIW'}, hovIznXR=RetentionLease{id='hovIznXR', retainingSequenceNumber=5545484624380650799, timestamp=1554858598089, source='jiCmznJs'}, mBhGDLXY=RetentionLease{id='mBhGDLXY', retainingSequenceNumber=6063347294587841909, timestamp=1554858598089, source='SeyuowYm'}, cEBQPZIr=RetentionLease{id='cEBQPZIr', retainingSequenceNumber=2620965069903679675, timestamp=1554858598089, source='jhPyzeGR'}, HIrqcIio=RetentionLease{id='HIrqcIio', retainingSequenceNumber=6143977810763443381, timestamp=1554858598289, source='XaeYAcpt'}}>
         but: was <{INNMdoBL=RetentionLease{id='INNMdoBL', retainingSequenceNumber=1244452474034992388, timestamp=1554858598089, source='oVZCIoTx'}, KKnWFaGS=RetentionLease{id='KKnWFaGS', retainingSequenceNumber=1290850561280207735, timestamp=1554858598089, source='VuJZzKIW'}, hovIznXR=RetentionLease{id='hovIznXR', retainingSequenceNumber=5545484624380650799, timestamp=1554858598089, source='jiCmznJs'}, mBhGDLXY=RetentionLease{id='mBhGDLXY', retainingSequenceNumber=6063347294587841909, timestamp=1554858598089, source='SeyuowYm'}, cEBQPZIr=RetentionLease{id='cEBQPZIr', retainingSequenceNumber=2620965069903679675, timestamp=1554858598089, source='jhPyzeGR'}, HIrqcIio=RetentionLease{id='HIrqcIio', retainingSequenceNumber=6143977810763443381, timestamp=1554858598089, source='XaeYAcpt'}}>
        at __randomizedtesting.SeedInfo.seed([DCDF8E360984B62B:96CBEB5F816E79DC]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.index.seqno.RetentionLeaseIT.testRetentionLeasesSyncOnRecovery(RetentionLeaseIT.java:431)

Relates #39105

dnhatn · 2019-05-20T02:08:47Z

I can construct a sequence of events that leads to this failure. I enabled debug log in 0ec7986 to confirm my assumption before proposing the fix.

jasontedor · 2019-05-21T15:47:13Z

@dnhatn What is the sequence of events that you constructed?

dnhatn · 2019-05-22T02:14:42Z

@jasontedor If the replica allocated on the master node, then the master can remove the primary shard when we manually break the connection. Since we don't persist retention leases on renewal, the primary loads the persisted version from disk which is not the latest version when it's reallocated.

I planned to fix this by persisting retention leases on renewal. However, with #42299, this approach is no longer valid.

jasontedor · 2019-05-24T16:29:22Z

@dnhatn I think that we can remove the renewal from the test, it doesn't add anything.

Relates elastic#39105

hendrikmuhs added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 19, 2019

jasontedor changed the title ~~[CI] testRetentionLeaseSyncedOnRemove fails~~ [CI] testRetentionLeasesSyncOnRecovery fails Feb 19, 2019

jasontedor closed this as completed in 44fc57f Feb 19, 2019

jasontedor self-assigned this Feb 19, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jtibshirani reopened this Apr 10, 2019

dnhatn self-assigned this May 17, 2019

dnhatn added a commit that referenced this issue May 20, 2019

Enable debug log in testRetentionLeasesSyncOnRecovery

0ec7986

Relates #39105

dnhatn added a commit that referenced this issue May 20, 2019

Enable debug log in testRetentionLeasesSyncOnRecovery

f765e21

Relates #39105

dnhatn unassigned jasontedor May 20, 2019

dnhatn removed the v7.0.0-rc2 label May 20, 2019

jkakavas mentioned this issue May 21, 2019

add 7.1.1 and 6.8.1 versions #42253

Merged

jasontedor mentioned this issue May 24, 2019

Remove renewal in retention lease recovery test #42536

Merged

jasontedor closed this as completed in #42536 May 24, 2019

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019

Enable debug log in testRetentionLeasesSyncOnRecovery

56b901e

Relates elastic#39105

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] testRetentionLeasesSyncOnRecovery fails #39105

[CI] testRetentionLeasesSyncOnRecovery fails #39105

hendrikmuhs commented Feb 19, 2019

elasticmachine commented Feb 19, 2019

jasontedor commented Feb 19, 2019

jasontedor commented Feb 19, 2019

jasontedor commented Feb 19, 2019

romseygeek commented Mar 19, 2019

jtibshirani commented Apr 10, 2019

dnhatn commented May 20, 2019

jasontedor commented May 21, 2019

dnhatn commented May 22, 2019

jasontedor commented May 24, 2019

[CI] testRetentionLeasesSyncOnRecovery fails #39105

[CI] testRetentionLeasesSyncOnRecovery fails #39105

Comments

hendrikmuhs commented Feb 19, 2019

elasticmachine commented Feb 19, 2019

jasontedor commented Feb 19, 2019

jasontedor commented Feb 19, 2019

jasontedor commented Feb 19, 2019

romseygeek commented Mar 19, 2019

jtibshirani commented Apr 10, 2019

dnhatn commented May 20, 2019

jasontedor commented May 21, 2019

dnhatn commented May 22, 2019

jasontedor commented May 24, 2019