Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

Closed
edsavage opened this issue Jun 5, 2023 · 11 comments · Fixed by #100610
Closed

[CI] CreateIndexIT testCreateAndDeleteIndexConcurrently failing #96578

edsavage opened this issue Jun 5, 2023 · 11 comments · Fixed by #100610
Labels
blocker :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@edsavage
Copy link
Contributor

edsavage commented Jun 5, 2023

Build scan:
https://gradle-enterprise.elastic.co/s/dxnvahurwubcu/tests/:server:internalClusterTest/org.elasticsearch.action.admin.indices.create.CreateIndexIT/testCreateAndDeleteIndexConcurrently

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently" -Dtests.seed=3ED3285ADA7A0152 -Dtests.locale=is -Dtests.timezone=Africa/Tunis -Druntime.java=20

Applicable branches:
main, 8.8

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.action.admin.indices.create.CreateIndexIT&tests.test=testCreateAndDeleteIndexConcurrently

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([3ED3285ADA7A0152]:0)

@edsavage edsavage added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Jun 5, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@martijnvg
Copy link
Member

I also noticed this assertion failure in a recent test failure:

	
org.elasticsearch.action.admin.indices.create.CreateIndexIT > classMethod FAILED	
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=233, name=elasticsearch[node_t0][write][T#1], state=RUNNABLE, group=TGRP-CreateIndexIT]	
        Caused by:	
        java.lang.AssertionError: must be started to do primary indexing	
            at __randomizedtesting.SeedInfo.seed([F68F828EDF794A8]:0)	
            at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:2154)	
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:953)	
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:900)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:354)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:219)	
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:286)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:137)	
            at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74)	
            at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)	
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)	
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)	
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)	
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)	
            at java.base/java.lang.Thread.run(Thread.java:1623)

@martijnvg
Copy link
Member

Btw I don't think this a failure for the search team? This test creates and deletes the target index for writes concurrently while indexing.

@martijnvg martijnvg added :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team and removed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team labels Jun 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@kingherc
Copy link
Contributor

I also just saw a similar error as the one mentioned above, here for a different test (testChangeLeaderIndex):

12:28:26 REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.LocalIndexFollowingIT.testChangeLeaderIndex" -Dtests.seed=514E751CCA76CD03 -Dtests.locale=ar-QA -Dtests.timezone=Etc/GMT+7 -Druntime.java=20
12:28:27 
12:28:27 org.elasticsearch.xpack.ccr.LocalIndexFollowingIT > testChangeLeaderIndex FAILED
12:28:27     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=80, name=elasticsearch[node_s_0][write][T#1], state=RUNNABLE, group=TGRP-LocalIndexFollowingIT]
12:28:27         at __randomizedtesting.SeedInfo.seed([514E751CCA76CD03:4DC2AFDC89283D04]:0)
12:28:27 
12:28:27         Caused by:
12:28:27         java.lang.AssertionError: must be started to do primary indexing
12:28:27             at __randomizedtesting.SeedInfo.seed([514E751CCA76CD03]:0)
12:28:27             at org.elasticsearch.index.shard.IndexShard.ensureWriteAllowed(IndexShard.java:2154)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:953)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1832)
12:28:27             at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1819)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.shardOperationOnPrimary(TransportBulkShardOperationsAction.java:181)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.lambda$dispatchedShardOperationOnPrimary$0(TransportBulkShardOperationsAction.java:102)
12:28:27             at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:313)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.dispatchedShardOperationOnPrimary(TransportBulkShardOperationsAction.java:101)
12:28:27             at org.elasticsearch.xpack.ccr.action.bulk.TransportBulkShardOperationsAction.dispatchedShardOperationOnPrimary(TransportBulkShardOperationsAction.java:41)
12:28:27             at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:215)
12:28:27             at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
12:28:27             at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
12:28:27             at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
12:28:27             at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
12:28:27             at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
12:28:27             at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
12:28:27             at java.base/java.lang.Thread.run(Thread.java:1623)

I think it's probably the same cause, that's why I mention it on this ticket.

@cbuescher
Copy link
Member

cbuescher commented Sep 12, 2023

Another one today with timeouts on 8.9: https://gradle-enterprise.elastic.co/s/63qen2ntob2u4

Looking at the thread dump this seems to be the interesting parts to me, but I might be wrong:

  2> "TEST-CreateIndexIT.testCreateAndDeleteIndexConcurrently-seed#[7335A8C9448F4C48]" ID=452 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@5da22735
  2> 	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  2> 	- waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@5da22735
  2> 	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
  2> 	at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:326)
  2> 	at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:95)
  2> 	at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
  2> 	at app//org.elasticsearch.indices.recovery.PeerRecoverySourceService$OngoingRecoveries.awaitEmpty(PeerRecoverySourceService.java:283)

2> "elasticsearch[node_t1][generic][T#6]" ID=502 WAITING on org.elasticsearch.index.seqno.ReplicationTracker@6828e261
  2> 	at [email protected]/java.lang.Object.wait0(Native Method)
  2> 	- waiting on org.elasticsearch.index.seqno.ReplicationTracker@6828e261
  2> 	at [email protected]/java.lang.Object.wait(Object.java:366)
  2> 	at [email protected]/java.lang.Object.wait(Object.java:339)
  2> 	at app//org.elasticsearch.index.seqno.ReplicationTracker.waitForLocalCheckpointToAdvance(ReplicationTracker.java:1565)
  2> 	at app//org.elasticsearch.index.seqno.ReplicationTracker.markAllocationIdAsInSync(ReplicationTracker.java:1275)
  2> 	at app//org.elasticsearch.index.shard.IndexShard.markAllocationIdAsInSync(IndexShard.java:2706)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$48(RecoverySourceHandler.java:1257)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6160/0x00007f7980bdf938.run(Unknown Source)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$21(RecoverySourceHandler.java:431)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6122/0x00007f7980bd1b98.get(Unknown Source)
  2> 	at app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:313)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$22(RecoverySourceHandler.java:430)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$6117/0x00007f7980bd12f8.accept(Unknown Source)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$19(RecoverySourceHandler.java:420)
  2> 	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5544/0x00007f7980888d00.accept(Unknown Source)
  2> 	at app//org.elasticsearch.action.ActionListener.run(ActionListener.java:382)

@DaveCTurner
Copy link
Contributor

This failed again today with the same symptoms: stopping the node leaves a recovery in a state that will never complete, which prevents the node from closing. Relabelling this as a recovery bug.

@DaveCTurner DaveCTurner added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. medium-risk An open issue or test failure that is a medium risk to future releases and removed :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team labels Oct 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 10, 2023
DaveCTurner added a commit that referenced this issue Oct 10, 2023
@DaveCTurner
Copy link
Contributor

Possibly related to #100589.

@DaveCTurner
Copy link
Contributor

Ugh yeah we're missing a CancellableThreads#execute here:

diff --git a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
index fc5df1a4aa2..81bc226102f 100644
--- a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
+++ b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@@ -33,7 +33,6 @@ import org.elasticsearch.common.unit.ByteSizeValue;
 import org.elasticsearch.common.util.CancellableThreads;
 import org.elasticsearch.common.util.concurrent.CountDown;
 import org.elasticsearch.common.util.set.Sets;
-import org.elasticsearch.core.CheckedRunnable;
 import org.elasticsearch.core.IOUtils;
 import org.elasticsearch.core.Nullable;
 import org.elasticsearch.core.Releasable;
@@ -426,7 +425,7 @@ public class RecoverySourceHandler {
     }

     static void runUnderPrimaryPermit(
-        CheckedRunnable<Exception> action,
+        Runnable action,
         IndexShard primary,
         CancellableThreads cancellableThreads,
         ActionListener<Void> listener
@@ -1260,7 +1259,7 @@ public class RecoverySourceHandler {
          */
         final SubscribableListener<Void> markInSyncStep = new SubscribableListener<>();
         runUnderPrimaryPermit(
-            () -> shard.markAllocationIdAsInSync(request.targetAllocationId(), targetLocalCheckpoint),
+            () -> cancellableThreads.execute(() -> shard.markAllocationIdAsInSync(request.targetAllocationId(), targetLocalCheckpoint)),
             shard,
             cancellableThreads,
             markInSyncStep

@DaveCTurner DaveCTurner added blocker and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Oct 10, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 10, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in elastic#95270.

Closes elastic#96578
Closes elastic#100589
DaveCTurner added a commit that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in #95270.

Closes #96578
Closes #100589
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in elastic#95270.

Closes elastic#96578
Closes elastic#100589
elasticsearchmachine pushed a commit that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in #95270.

Closes #96578
Closes #100589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants