-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] testRetentionLeasesSyncOnRecovery fails #39105
Comments
Pinging @elastic/es-distributed |
Note the test that is failing here is
|
Also, the maps are not equal, there is a timestamp difference.
|
I found the issue, test bug. I will push a fix shortly. |
This test had a bug. We attempt to allow only the primary to be allocated, to force all replicas to recovery from the primary after we had set the state of the retention leases on the primary. However, in building the index settings, we were overwriting the settings that exclude the replicas from being allocated. This means that some of the replicas would end up assigned and rather than receive retention leases during recovery, they would be part of the replication group receiving retention leases as they are manipulated. Since retention lease renewals are only synced periodically, this means that the replica could be lagging a little behind in some cases leading to an assertion tripping in the test. This commit addresses this by ensuring that the replicas are indeed not allocated until after the retention leases are done being manipulated on the replica. We did this by not overwriting the exclude settings. Closes #39105
This test had a bug. We attempt to allow only the primary to be allocated, to force all replicas to recovery from the primary after we had set the state of the retention leases on the primary. However, in building the index settings, we were overwriting the settings that exclude the replicas from being allocated. This means that some of the replicas would end up assigned and rather than receive retention leases during recovery, they would be part of the replication group receiving retention leases as they are manipulated. Since retention lease renewals are only synced periodically, this means that the replica could be lagging a little behind in some cases leading to an assertion tripping in the test. This commit addresses this by ensuring that the replicas are indeed not allocated until after the retention leases are done being manipulated on the replica. We did this by not overwriting the exclude settings. Closes #39105
This test had a bug. We attempt to allow only the primary to be allocated, to force all replicas to recovery from the primary after we had set the state of the retention leases on the primary. However, in building the index settings, we were overwriting the settings that exclude the replicas from being allocated. This means that some of the replicas would end up assigned and rather than receive retention leases during recovery, they would be part of the replication group receiving retention leases as they are manipulated. Since retention lease renewals are only synced periodically, this means that the replica could be lagging a little behind in some cases leading to an assertion tripping in the test. This commit addresses this by ensuring that the replicas are indeed not allocated until after the retention leases are done being manipulated on the replica. We did this by not overwriting the exclude settings. Closes #39105
This test has failed again on 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/2792/console
|
This failed again on master intake: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/3053/ Excerpt from logs:
|
I can construct a sequence of events that leads to this failure. I enabled debug log in 0ec7986 to confirm my assumption before proposing the fix. |
@dnhatn What is the sequence of events that you constructed? |
@jasontedor If the replica allocated on the master node, then the master can remove the primary shard when we manually break the connection. Since we don't persist retention leases on renewal, the primary loads the persisted version from disk which is not the latest version when it's reallocated. I planned to fix this by persisting retention leases on renewal. However, with #42299, this approach is no longer valid. |
@dnhatn I think that we can remove the renewal from the test, it doesn't add anything. |
IMPORTANT : This test is still muted on master and failure applies to several versions.
Relates to #38764 - but I think this is a new issue.
Old fix: 5fad38e
Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+periodic/77/console
The test compares 2 maps, which are identical by content but differently ordered.
Log:
/CC @jasontedor
The text was updated successfully, but these errors were encountered: