[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

edsavage · 2023-06-05T14:26:33Z

Build scan:
https://gradle-enterprise.elastic.co/s/dxnvahurwubcu/tests/:server:internalClusterTest/org.elasticsearch.action.admin.indices.create.CreateIndexIT/testCreateAndDeleteIndexConcurrently

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently" -Dtests.seed=3ED3285ADA7A0152 -Dtests.locale=is -Dtests.timezone=Africa/Tunis -Druntime.java=20

Applicable branches:
main, 8.8

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.action.admin.indices.create.CreateIndexIT&tests.test=testCreateAndDeleteIndexConcurrently

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([3ED3285ADA7A0152]:0)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-06-05T14:26:57Z

Pinging @elastic/es-search (Team:Search)

martijnvg · 2023-06-15T07:39:12Z

I also noticed this assertion failure in a recent test failure:

	
org.elasticsearch.action.admin.indices.create.CreateIndexIT > classMethod FAILED	
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=233, name=elasticsearch[node_t0][write][T#1], state=RUNNABLE, group=TGRP-CreateIndexIT]	
        Caused by:	
        java.lang.AssertionError: must be started to do primary indexing	
            at __randomizedtesting.SeedInfo.seed([F68F828EDF794A8]:0)	
            at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:2154)	
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:953)	
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:900)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:354)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:219)	
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:286)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:137)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74)	
            at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)	
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)	
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)	
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
            at java.base/java.lang.Thread.run(Thread.java:1623)

martijnvg · 2023-06-15T07:41:39Z

Btw I don't think this a failure for the search team? This test creates and deletes the target index for writes concurrently while indexing.

elasticsearchmachine · 2023-06-15T07:43:07Z

Pinging @elastic/es-data-management (Team:Data Management)

kingherc · 2023-06-29T10:12:26Z

I also just saw a similar error as the one mentioned above, here for a different test (testChangeLeaderIndex):

12:28:26 REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.LocalIndexFollowingIT.testChangeLeaderIndex" -Dtests.seed=514E751CCA76CD03 -Dtests.locale=ar-QA -Dtests.timezone=Etc/GMT+7 -Druntime.java=20
12:28:27 
12:28:27 org.elasticsearch.xpack.ccr.LocalIndexFollowingIT > testChangeLeaderIndex FAILED
12:28:27     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=80, name=elasticsearch[node_s_0][write][T#1], state=RUNNABLE, group=TGRP-LocalIndexFollowingIT]
12:28:27         at __randomizedtesting.SeedInfo.seed([514E751CCA76CD03:4DC2AFDC89283D04]:0)
12:28:27 
12:28:27         Caused by:
12:28:27         java.lang.AssertionError: must be started to do primary indexing
12:28:27             at __randomizedtesting.SeedInfo.seed([514E751CCA76CD03]:0)
12:28:27             at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:2154)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:953)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1832)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1819)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.shardOperationOnPrimary(TransportBulkShardOperationsAction.java:181)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.lambda$dispatchedShardOperationOnPrimary$0(TransportBulkShardOperationsAction.java:102)
12:28:27             at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:313)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.dispatchedShardOperationOnPrimary(TransportBulkShardOperationsAction.java:101)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.dispatchedShardOperationOnPrimary(TransportBulkShardOperationsAction.java:41)
12:28:27             at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)
12:28:27             at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
12:28:27             at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
12:28:27             at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
12:28:27             at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
12:28:27             at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
12:28:27             at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
12:28:27             at java.base/java.lang.Thread.run(Thread.java:1623)

I think it's probably the same cause, that's why I mention it on this ticket.

luigidellaquila · 2023-08-11T11:34:20Z

Failed again today
https://gradle-enterprise.elastic.co/s/f66miyeliamuc/tests/:server:internalClusterTest/org.elasticsearch.action.admin.indices.create.CreateIndexIT/testCreateAndDeleteIndexConcurrently?top-execution=1

cbuescher · 2023-09-12T07:52:44Z

Another one today with timeouts on 8.9: https://gradle-enterprise.elastic.co/s/63qen2ntob2u4

Looking at the thread dump this seems to be the interesting parts to me, but I might be wrong:

  2> "TEST-CreateIndexIT.testCreateAndDeleteIndexConcurrently-seed#[7335A8C9448F4C48]" ID=452 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@5da22735
  2> 	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  2> 	- waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@5da22735
  2> 	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
  2> 	at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:326)
  2> 	at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:95)
  2> 	at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
  2> 	at app//org.elasticsearch.indices.recovery.PeerRecoverySourceService$OngoingRecoveries.awaitEmpty(PeerRecoverySourceService.java:283)

2> "elasticsearch[node_t1][generic][T#6]" ID=502 WAITING on org.elasticsearch.index.seqno.ReplicationTracker@6828e261
  2> 	at [email protected]/java.lang.Object.wait0(Native Method)
  2> 	- waiting on org.elasticsearch.index.seqno.ReplicationTracker@6828e261
  2> 	at [email protected]/java.lang.Object.wait(Object.java:366)
  2> 	at [email protected]/java.lang.Object.wait(Object.java:339)
  2> 	at app//org.elasticsearch.index.seqno.ReplicationTracker.waitForLocalCheckpointToAdvance(ReplicationTracker.java:1565)
  2> 	at app//org.elasticsearch.index.seqno.ReplicationTracker.markAllocationIdAsInSync(ReplicationTracker.java:1275)
  2> 	at app//org.elasticsearch.index.shard.IndexShard.markAllocationIdAsInSync(IndexShard.java:2706)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$48(RecoverySourceHandler.java:1257)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6160/0x00007f7980bdf938.run(Unknown Source)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$21(RecoverySourceHandler.java:431)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6122/0x00007f7980bd1b98.get(Unknown Source)
  2> 	at app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:313)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$22(RecoverySourceHandler.java:430)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6117/0x00007f7980bd12f8.accept(Unknown Source)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$19(RecoverySourceHandler.java:420)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5544/0x00007f7980888d00.accept(Unknown Source)
  2> 	at app//org.elasticsearch.action.ActionListener.run(ActionListener.java:382)

DaveCTurner · 2023-10-10T10:08:42Z

This failed again today with the same symptoms: stopping the node leaves a recovery in a state that will never complete, which prevents the node from closing. Relabelling this as a recovery bug.

elasticsearchmachine · 2023-10-10T10:09:31Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-10-10T11:25:55Z

Possibly related to #100589.

DaveCTurner · 2023-10-10T12:06:05Z

Ugh yeah we're missing a CancellableThreads#execute here:

diff --git a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
index fc5df1a4aa2..81bc226102f 100644
--- a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
+++ b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@@ -33,7 +33,6 @@ import org.elasticsearch.common.unit.ByteSizeValue;
 import org.elasticsearch.common.util.CancellableThreads;
 import org.elasticsearch.common.util.concurrent.CountDown;
 import org.elasticsearch.common.util.set.Sets;
-import org.elasticsearch.core.CheckedRunnable;
 import org.elasticsearch.core.IOUtils;
 import org.elasticsearch.core.Nullable;
 import org.elasticsearch.core.Releasable;
@@ -426,7 +425,7 @@ public class RecoverySourceHandler {
     }

     static void runUnderPrimaryPermit(
-        CheckedRunnable<Exception> action,
+        Runnable action,
         IndexShard primary,
         CancellableThreads cancellableThreads,
         ActionListener<Void> listener
@@ -1260,7 +1259,7 @@ public class RecoverySourceHandler {
          */
         final SubscribableListener<Void> markInSyncStep = new SubscribableListener<>();
         runUnderPrimaryPermit(
-            () -> shard.markAllocationIdAsInSync(request.targetAllocationId(), targetLocalCheckpoint),
+            () -> cancellableThreads.execute(() -> shard.markAllocationIdAsInSync(request.targetAllocationId(), targetLocalCheckpoint)),
             shard,
             cancellableThreads,
             markInSyncStep

`IndexShard#markAllocationIdAsInSync` is interruptible because it may block the thread on a monitor waiting for the local checkpoint to advance, but we lost the ability to interrupt it on a recovery cancellation in elastic#95270. Closes elastic#96578 Closes elastic#100589

`IndexShard#markAllocationIdAsInSync` is interruptible because it may block the thread on a monitor waiting for the local checkpoint to advance, but we lost the ability to interrupt it on a recovery cancellation in #95270. Closes #96578 Closes #100589

`IndexShard#markAllocationIdAsInSync` is interruptible because it may block the thread on a monitor waiting for the local checkpoint to advance, but we lost the ability to interrupt it on a recovery cancellation in elastic#95270. Closes elastic#96578 Closes elastic#100589

`IndexShard#markAllocationIdAsInSync` is interruptible because it may block the thread on a monitor waiting for the local checkpoint to advance, but we lost the ability to interrupt it on a recovery cancellation in #95270. Closes #96578 Closes #100589

edsavage added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Jun 5, 2023

elasticsearchmachine added the Team:Search Meta label for search team label Jun 5, 2023

martijnvg added :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team and removed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team labels Jun 15, 2023

kingherc mentioned this issue Jun 29, 2023

Increase stateless default refresh interval #97225

Merged

pxsalehi mentioned this issue Sep 20, 2023

Bump TransportVersion and add RecoveryCommitTooNewException #99591

Merged

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 10, 2023

DaveCTurner added a commit that referenced this issue Oct 10, 2023

AwaitsFix for #96578

4e6e539

DaveCTurner added blocker and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Oct 10, 2023

DaveCTurner mentioned this issue Oct 10, 2023

Fix interruption of markAllocationIdAsInSync #100610

Merged

DaveCTurner closed this as completed in #100610 Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

edsavage commented Jun 5, 2023 •

edited by martijnvg

Loading

elasticsearchmachine commented Jun 5, 2023

martijnvg commented Jun 15, 2023

martijnvg commented Jun 15, 2023

elasticsearchmachine commented Jun 15, 2023

kingherc commented Jun 29, 2023

luigidellaquila commented Aug 11, 2023

cbuescher commented Sep 12, 2023 •

edited

Loading

DaveCTurner commented Oct 10, 2023

elasticsearchmachine commented Oct 10, 2023

DaveCTurner commented Oct 10, 2023

DaveCTurner commented Oct 10, 2023

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

Comments

edsavage commented Jun 5, 2023 • edited by martijnvg Loading

elasticsearchmachine commented Jun 5, 2023

martijnvg commented Jun 15, 2023

martijnvg commented Jun 15, 2023

elasticsearchmachine commented Jun 15, 2023

kingherc commented Jun 29, 2023

luigidellaquila commented Aug 11, 2023

cbuescher commented Sep 12, 2023 • edited Loading

DaveCTurner commented Oct 10, 2023

elasticsearchmachine commented Oct 10, 2023

DaveCTurner commented Oct 10, 2023

DaveCTurner commented Oct 10, 2023

edsavage commented Jun 5, 2023 •

edited by martijnvg

Loading

cbuescher commented Sep 12, 2023 •

edited

Loading