Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

Closed
DaveCTurner opened this issue Jun 28, 2023 · 4 comments · Fixed by #115341
Closed

[CI] IndexRecoveryIT testPeerRecoveryTrimsLocalTranslog failing #97183

DaveCTurner opened this issue Jun 28, 2023 · 4 comments · Fixed by #115341
Assignees
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

The failing assertion was added recently in #97133. I'll suppress it while investigating.

Build scan:
https://gradle-enterprise.elastic.co/s/qlsep6gfoto36/tests/:server:internalClusterTest/org.elasticsearch.indices.recovery.IndexRecoveryIT/testPeerRecoveryTrimsLocalTranslog

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.indices.recovery.IndexRecoveryIT.testPeerRecoveryTrimsLocalTranslog" -Dtests.seed=6C5DEBB10362A47D -Dtests.locale=mt-MT -Dtests.timezone=America/Thule -Druntime.java=20

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.indices.recovery.IndexRecoveryIT&tests.test=testPeerRecoveryTrimsLocalTranslog

Failure excerpt:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=4364, name=elasticsearch[node_t1][write][T#2], state=RUNNABLE, group=TGRP-IndexRecoveryIT]

  at __randomizedtesting.SeedInfo.seed([6C5DEBB10362A47D:91DBA9F3E716E8A1]:0)

  Caused by: java.lang.AssertionError: (No message provided)

    at __randomizedtesting.SeedInfo.seed([6C5DEBB10362A47D]:0)
    at org.elasticsearch.index.shard.IndexShard.syncAfterWrite(IndexShard.java:3630)
    at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:518)
    at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.runPostReplicationActions(TransportWriteAction.java:318)
    at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:173)
    at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:169)
    at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:94)
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:319)
    at org.elasticsearch.action.bulk.TransportShardBulkAction$2.finishRequest(TransportShardBulkAction.java:273)
    at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:235)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:286)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:137)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74)
    at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.lang.Thread.run(Thread.java:1623)

@DaveCTurner DaveCTurner added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI labels Jun 28, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 28, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 28, 2023
@DaveCTurner DaveCTurner self-assigned this Jun 28, 2023
elasticsearchmachine pushed a commit that referenced this issue Jun 28, 2023
@DaveCTurner
Copy link
Contributor Author

The issue here is that when we shut a node down we may fail some operations, but still try and sync the translog afterwards. The failures may arise from shutting down the ClusterService or TransportService before the IndicesService, so the shard itself is not closed yet.

I think this highlights some deficiencies in our shutdown behaviour: while we're shutting down we shouldn't really be trying to fail shards on other nodes, nor really to sync the translog.

@andreidan
Copy link
Contributor

I think we have some more of this

09:47:13 org.elasticsearch.indices.recovery.IndexRecoveryIT > testWaitForClusterStateToBeAppliedOnSourceNode FAILED
09:47:13     java.lang.AssertionError: timed out waiting for green state
09:47:13         at org.junit.Assert.fail(Assert.java:88)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:1004)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:928)
09:47:13         at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:917)
09:47:13         at org.elasticsearch.indices.recovery.IndexRecoveryIT.testWaitForClusterStateToBeAppliedOnSourceNode(IndexRecoveryIT.java:1694)
09:47:13 
09:47:13     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=4872, name=elasticsearch[node_t2][clusterApplierService#updateTask][T#1], state=RUNNABLE, group=TGRP-IndexRecoveryIT]
09:47:13 
09:47:13         Caused by:
09:47:13         java.lang.AssertionError
09:47:13             at __randomizedtesting.SeedInfo.seed([8EBE8D5A7A8C107B]:0)
09:47:13             at org.junit.Assert.fail(Assert.java:86)
09:47:13             at org.junit.Assert.assertTrue(Assert.java:41)
09:47:13             at org.junit.Assert.assertFalse(Assert.java:64)
09:47:13             at org.junit.Assert.assertFalse(Assert.java:74)
09:47:13             at org.elasticsearch.indices.recovery.RecoveryClusterStateDelayListeners.delayUntilRecoveryStart(RecoveryClusterStateDelayListeners.java:77)
09:47:13             at org.elasticsearch.indices.recovery.IndexRecoveryIT.lambda$testWaitForClusterStateToBeAppliedOnSourceNode$44(IndexRecoveryIT.java:1684)

org.elasticsearch.transport.TransportException: Cannot send message, event loop is shutting down.

https://gradle-enterprise.elastic.co/s/vmb6afdkis2wk

@volodk85 volodk85 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 23, 2023
@DaveCTurner DaveCTurner removed their assignment Nov 22, 2023
@DaveCTurner DaveCTurner added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Oct 4, 2024
@DaveCTurner
Copy link
Contributor Author

I'm relabelling this as :Distributed/CRUD because the problem is on the document write path and IndexShardOperationPermits.

kingherc added a commit to kingherc/elasticsearch that referenced this issue Oct 22, 2024
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
kingherc added a commit to kingherc/elasticsearch that referenced this issue Oct 22, 2024
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
kingherc added a commit to kingherc/elasticsearch that referenced this issue Oct 22, 2024
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
@kingherc kingherc self-assigned this Oct 22, 2024
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
Problem:
finishAsFailed could be called asynchronously in the
middle of operations like runPostReplicationActions which try to
sync the translog. finishAsFailed immediately triggers the failure
of the resultListener which releases the index shard primary
operation permit. This means that runPostReplicationActions may
try to sync the translog without an operation permit.

Solution:
We refactor the infrastructure of ReplicationOperation regarding
pendingActions and the resultListener, by replacing them with a
RefCountingListener. This way, if there are async failures, they
are aggregated, and the result listener is called once, after all
mid-way operations are done.

For the specific error we got in issue elastic#97183, this means that
a call to onNoLongerPrimary (which can happen if we fail to fail
a replica shard or mark it as stale) will not immediately release
the primary operation permit and the assertion in the translog sync
will be honored.

Fixes elastic#97183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants