[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

andyb-elastic · 2018-02-15T21:09:43Z

Doesn't reproduce locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1385/consoleText

build-1385-RemoteClusterConnectionTests.txt

  2> REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=56BD32C042A1AA87 -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests -Dtests.method="testTriggerUpdatesConcurrently" -Dtests.security.manager=true -Dtests.locale=sv -Dtests.timezone=America/Blanc-Sablon
ERROR   0.44s J1 | RemoteClusterConnectionTests.testTriggerUpdatesConcurrently <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=368, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([56BD32C042A1AA87:D034FFCE57266DE0]:0)
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([56BD32C042A1AA87]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$6.lambda$run$1(RemoteClusterConnectionTests.java:596)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:138)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:407)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:474)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:412)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

@elastic/es-core-infra can someone take a look at this

The text was updated successfully, but these errors were encountered:

andyb-elastic · 2018-02-15T21:14:00Z

Also on 6.2

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-unix-compatibility/os=amazon/104/console
build-104-RemoteClusterConnectionTests.txt

andyb-elastic · 2018-02-16T00:44:17Z

This has failed several times recently in CI across different workers and operating systems, but doesn't reproduce locally. I've muted it on master (70b279d) 6.x (04b77fd) and 6.2 (a6c01b7)

martijnvg · 2018-02-16T09:25:24Z

Other test failures in the same test class fail with the same assertion error. The last test gets blamed for the the uncaught assertion error that the testTriggerUpdatesConcurrently test causes:

REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=C57042CE2984974E -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests -Dtests.method="testGetConnectionInfo" -Dtests.security.manager=true -Dtests.locale=es-VE -Dtests.timezone=SystemV/PST8
ERROR   0.35s J1 | RemoteClusterConnectionTests.testGetConnectionInfo <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1222, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([C57042CE2984974E:CCBBF5CDD215623A]:0)
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([C57042CE2984974E]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$6.lambda$run$1(RemoteClusterConnectionTests.java:596)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:138)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:407)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:474)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:412)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

I'm also unable to reproduce locally and still trying to understand this failure.

~~I suspect that #28667 is the cause for the RemoteClusterConnectionTests failing so often now.~~
I suspect that the changes in #28667 uncovered these assertion errors that would otherwise not have been noticed.

This tests also fails on the 5.6 branch, so I'll will mute it there too:

ERROR   0.11s J2 | RemoteClusterConnectionTests.testNodeDisconnected <<< FAILURES!
 Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=417, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([CF40251205CBC3BE:6143103BEBD223C]:0)
  2> NOTE: test params are: codec=Asserting(Lucene62): {}, docValues:{}, maxPointsInLeafNode=1971, maxMBSortInHeap=6.355498547106645, sim=RandomSimilarity(queryNorm=false,coord=no): {}, locale=es-EC, timezone=Etc/UCT
  2> NOTE: Linux 4.4.0-1050-aws amd64/Oracle Corporation 1.8.0_162 (64-bit)/cpus=4,threads=1,free=430621368,total=516423680
  2> NOTE: All tests run in this JVM: [NodeAllocationResultTests, RestTableTests, QueriesTests, InternalFilterTests, SpanOrQueryBuilderTests, BalancedSingleShardTests, CodecTests, TransportReplicationActionTests, BootstrapChecksTests, InternalEngineSettingsTests, BucketScriptTests, ExistsQueryBuilderTests, CustomUnifiedHighlighterTests, PrimaryTermsTests, ContextPreservingActionListenerTests, SearchScrollAsyncActionTests, DateTimeUnitTests, MinTests, BulkShardRequestTests, NodeRemovalClusterStateTaskExecutorTests, RemoteClusterConnectionTests]
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([CF40251205CBC3BE]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$5.lambda$run$1(RemoteClusterConnectionTests.java:504)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:101)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:397)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:464)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:402)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

5.6 Build url: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+oracle-java9-periodic/92/consoleText

Relates to #28695

This test has a race condition. The action listener used to listen for connections has a guard against being executed twice. However, this listener can be executed twice. After on success is invoked the test starts to tear down. At this point, the threads the test forked will terminate and the remote cluster connection will be closed. However, a thread forked to the management thread pool by the remote cluster connection can still be executing and try to continue connecting. This thread will be cancelled when the remote cluster connection is closed and this leads to the action listener being invoked again. To address this, we explicitly check that the reason that on failure was invoked was cancellation, and we assert that the listener was already previously invoked. Interestingly, this issue has always been present yet a recent change (#28667) exposed errors that occur on tasks submitted to the thread pool and were silently being lost. Relates #28695

jasontedor · 2018-02-16T12:41:43Z

Closed by 10666a4

This test has a race condition. The action listener used to listen for connections has a guard against being executed twice. However, this listener can be executed twice. After on success is invoked the test starts to tear down. At this point, the threads the test forked will terminate and the remote cluster connection will be closed. However, a thread forked to the management thread pool by the remote cluster connection can still be executing and try to continue connecting. This thread will be cancelled when the remote cluster connection is closed and this leads to the action listener being invoked again. To address this, we explicitly check that the reason that on failure was invoked was cancellation, and we assert that the listener was already previously invoked. Interestingly, this issue has always been present yet a recent change (#28667) exposed errors that occur on tasks submitted to the thread pool and were silently being lost. Relates #28695

andyb-elastic added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Feb 15, 2018

martijnvg added a commit that referenced this issue Feb 16, 2018

Muted test

2fc2c67

Relates to #28695

jasontedor closed this as completed Feb 16, 2018

vladimirdolzhenko mentioned this issue May 18, 2018

[CI] RemoteClusterConnectionTests.testDiscoverSingleNode #30714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

andyb-elastic commented Feb 15, 2018

andyb-elastic commented Feb 15, 2018

andyb-elastic commented Feb 16, 2018

martijnvg commented Feb 16, 2018 •

edited

Loading

jasontedor commented Feb 16, 2018

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

Comments

andyb-elastic commented Feb 15, 2018

andyb-elastic commented Feb 15, 2018

andyb-elastic commented Feb 16, 2018

martijnvg commented Feb 16, 2018 • edited Loading

jasontedor commented Feb 16, 2018

martijnvg commented Feb 16, 2018 •

edited

Loading