[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

DaveCTurner · 2023-06-28T12:28:30Z

The failing assertion was added recently in #97133. I'll suppress it while investigating.

Build scan:
https://gradle-enterprise.elastic.co/s/qlsep6gfoto36/tests/:server:internalClusterTest/org.elasticsearch.indices.recovery.IndexRecoveryIT/testPeerRecoveryTrimsLocalTranslog

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.indices.recovery.IndexRecoveryIT.testPeerRecoveryTrimsLocalTranslog" -Dtests.seed=6C5DEBB10362A47D -Dtests.locale=mt-MT -Dtests.timezone=America/Thule -Druntime.java=20

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.indices.recovery.IndexRecoveryIT&tests.test=testPeerRecoveryTrimsLocalTranslog

Failure excerpt:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=4364, name=elasticsearch[node_t1][write][T#2], state=RUNNABLE, group=TGRP-IndexRecoveryIT]

  at __randomizedtesting.SeedInfo.seed([6C5DEBB10362A47D:91DBA9F3E716E8A1]:0)

  Caused by: java.lang.AssertionError: (No message provided)

    at __randomizedtesting.SeedInfo.seed([6C5DEBB10362A47D]:0)
    at org.elasticsearch.index.shard.IndexShard.syncAfterWrite(IndexShard.java:3630)
    at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:518)
    at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.runPostReplicationActions(TransportWriteAction.java:318)
    at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:173)
    at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:169)
    at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:94)
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:319)
    at org.elasticsearch.action.bulk.TransportShardBulkAction$2.finishRequest(TransportShardBulkAction.java:273)
    at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:235)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:286)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:137)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74)
    at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.lang.Thread.run(Thread.java:1623)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-06-28T12:28:54Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-06-28T15:50:53Z

The issue here is that when we shut a node down we may fail some operations, but still try and sync the translog afterwards. The failures may arise from shutting down the ClusterService or TransportService before the IndicesService, so the shard itself is not closed yet.

I think this highlights some deficiencies in our shutdown behaviour: while we're shutting down we shouldn't really be trying to fail shards on other nodes, nor really to sync the translog.

andreidan · 2023-09-27T09:00:57Z

I think we have some more of this

09:47:13 org.elasticsearch.indices.recovery.IndexRecoveryIT > testWaitForClusterStateToBeAppliedOnSourceNode FAILED
09:47:13     java.lang.AssertionError: timed out waiting for green state
09:47:13         at org.junit.Assert.fail(Assert.java:88)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:1004)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:928)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:917)
09:47:13         at org.elasticsearch.indices.recovery.IndexRecoveryIT.testWaitForClusterStateToBeAppliedOnSourceNode(IndexRecoveryIT.java:1694)
09:47:13 
09:47:13     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=4872, name=elasticsearch[node_t2][clusterApplierService#updateTask][T#1], state=RUNNABLE, group=TGRP-IndexRecoveryIT]
09:47:13 
09:47:13         Caused by:
09:47:13         java.lang.AssertionError
09:47:13             at __randomizedtesting.SeedInfo.seed([8EBE8D5A7A8C107B]:0)
09:47:13             at org.junit.Assert.fail(Assert.java:86)
09:47:13             at org.junit.Assert.assertTrue(Assert.java:41)
09:47:13             at org.junit.Assert.assertFalse(Assert.java:64)
09:47:13             at org.junit.Assert.assertFalse(Assert.java:74)
09:47:13             at org.elasticsearch.indices.recovery.RecoveryClusterStateDelayListeners.delayUntilRecoveryStart(RecoveryClusterStateDelayListeners.java:77)
09:47:13             at org.elasticsearch.indices.recovery.IndexRecoveryIT.lambda$testWaitForClusterStateToBeAppliedOnSourceNode$44(IndexRecoveryIT.java:1684)

org.elasticsearch.transport.TransportException: Cannot send message, event loop is shutting down.

https://gradle-enterprise.elastic.co/s/vmb6afdkis2wk

DaveCTurner · 2024-10-04T11:17:10Z

I'm relabelling this as :Distributed/CRUD because the problem is on the document write path and IndexShardOperationPermits.

Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue elastic#97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes elastic#97183

DaveCTurner added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI labels Jun 28, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 28, 2023

DaveCTurner mentioned this issue Jun 28, 2023

Stateless primary relocations #97162

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 28, 2023

AwaitsFix for elastic#97183

f8bb4fb

DaveCTurner self-assigned this Jun 28, 2023

elasticsearchmachine pushed a commit that referenced this issue Jun 28, 2023

AwaitsFix for #97183 (#97184)

d8e22b7

volodk85 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 23, 2023

DaveCTurner removed their assignment Nov 22, 2023

DaveCTurner added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Oct 4, 2024

kingherc mentioned this issue Oct 22, 2024

ReplicationOperation should fail gracefully #115341

Merged

kingherc self-assigned this Oct 22, 2024

kingherc closed this as completed in #115341 Nov 1, 2024

kingherc closed this as completed in fc1d9d0 Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

DaveCTurner commented Jun 28, 2023

elasticsearchmachine commented Jun 28, 2023

DaveCTurner commented Jun 28, 2023

andreidan commented Sep 27, 2023

DaveCTurner commented Oct 4, 2024

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

Comments

DaveCTurner commented Jun 28, 2023

elasticsearchmachine commented Jun 28, 2023

DaveCTurner commented Jun 28, 2023

andreidan commented Sep 27, 2023

DaveCTurner commented Oct 4, 2024