Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

Closed
andyb-elastic opened this issue Feb 15, 2018 · 4 comments
Closed

[CI] RemoteClusterConnectionTests.testTriggerUpdatesConcurrently #28695

andyb-elastic opened this issue Feb 15, 2018 · 4 comments
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI

Comments

@andyb-elastic
Copy link
Contributor

Doesn't reproduce locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1385/consoleText

build-1385-RemoteClusterConnectionTests.txt

  2> REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=56BD32C042A1AA87 -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests -Dtests.method="testTriggerUpdatesConcurrently" -Dtests.security.manager=true -Dtests.locale=sv -Dtests.timezone=America/Blanc-Sablon
ERROR   0.44s J1 | RemoteClusterConnectionTests.testTriggerUpdatesConcurrently <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=368, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([56BD32C042A1AA87:D034FFCE57266DE0]:0)
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([56BD32C042A1AA87]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$6.lambda$run$1(RemoteClusterConnectionTests.java:596)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:138)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:407)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:474)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:412)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

@elastic/es-core-infra can someone take a look at this

@andyb-elastic andyb-elastic added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Feb 15, 2018
@andyb-elastic
Copy link
Contributor Author

This has failed several times recently in CI across different workers and operating systems, but doesn't reproduce locally. I've muted it on master (70b279d) 6.x (04b77fd) and 6.2 (a6c01b7)

@martijnvg
Copy link
Member

martijnvg commented Feb 16, 2018

Other test failures in the same test class fail with the same assertion error. The last test gets blamed for the the uncaught assertion error that the testTriggerUpdatesConcurrently test causes:

REPRODUCE WITH: ./gradlew :server:test -Dtests.seed=C57042CE2984974E -Dtests.class=org.elasticsearch.transport.RemoteClusterConnectionTests -Dtests.method="testGetConnectionInfo" -Dtests.security.manager=true -Dtests.locale=es-VE -Dtests.timezone=SystemV/PST8
ERROR   0.35s J1 | RemoteClusterConnectionTests.testGetConnectionInfo <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1222, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([C57042CE2984974E:CCBBF5CDD215623A]:0)
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([C57042CE2984974E]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$6.lambda$run$1(RemoteClusterConnectionTests.java:596)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:138)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:407)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:474)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:412)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

I'm also unable to reproduce locally and still trying to understand this failure.

I suspect that #28667 is the cause for the RemoteClusterConnectionTests failing so often now.
I suspect that the changes in #28667 uncovered these assertion errors that would otherwise not have been noticed.

This tests also fails on the 5.6 branch, so I'll will mute it there too:

ERROR   0.11s J2 | RemoteClusterConnectionTests.testNodeDisconnected <<< FAILURES!
 Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=417, name=elasticsearch[org.elasticsearch.transport.RemoteClusterConnectionTests][management][T#2], state=RUNNABLE, group=TGRP-RemoteClusterConnectionTests]
   > 	at __randomizedtesting.SeedInfo.seed([CF40251205CBC3BE:6143103BEBD223C]:0)
  2> NOTE: test params are: codec=Asserting(Lucene62): {}, docValues:{}, maxPointsInLeafNode=1971, maxMBSortInHeap=6.355498547106645, sim=RandomSimilarity(queryNorm=false,coord=no): {}, locale=es-EC, timezone=Etc/UCT
  2> NOTE: Linux 4.4.0-1050-aws amd64/Oracle Corporation 1.8.0_162 (64-bit)/cpus=4,threads=1,free=430621368,total=516423680
  2> NOTE: All tests run in this JVM: [NodeAllocationResultTests, RestTableTests, QueriesTests, InternalFilterTests, SpanOrQueryBuilderTests, BalancedSingleShardTests, CodecTests, TransportReplicationActionTests, BootstrapChecksTests, InternalEngineSettingsTests, BucketScriptTests, ExistsQueryBuilderTests, CustomUnifiedHighlighterTests, PrimaryTermsTests, ContextPreservingActionListenerTests, SearchScrollAsyncActionTests, DateTimeUnitTests, MinTests, BulkShardRequestTests, NodeRemovalClusterStateTaskExecutorTests, RemoteClusterConnectionTests]
   > Caused by: java.lang.AssertionError
   > 	at __randomizedtesting.SeedInfo.seed([CF40251205CBC3BE]:0)
   > 	at org.elasticsearch.transport.RemoteClusterConnectionTests$5.lambda$run$1(RemoteClusterConnectionTests.java:504)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67)
   > 	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:101)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.lambda$doRun$1(RemoteClusterConnection.java:397)
   > 	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:464)
   > 	at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:402)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   > 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

5.6 Build url: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+5.6+oracle-java9-periodic/92/consoleText

martijnvg added a commit that referenced this issue Feb 16, 2018
Relates to #28695
jasontedor added a commit that referenced this issue Feb 16, 2018
This test has a race condition. The action listener used to listen for
connections has a guard against being executed twice. However, this
listener can be executed twice. After on success is invoked the test
starts to tear down. At this point, the threads the test forked will
terminate and the remote cluster connection will be closed. However, a
thread forked to the management thread pool by the remote cluster
connection can still be executing and try to continue connecting. This
thread will be cancelled when the remote cluster connection is closed
and this leads to the action listener being invoked again. To address
this, we explicitly check that the reason that on failure was invoked
was cancellation, and we assert that the listener was already previously
invoked. Interestingly, this issue has always been present yet a recent
change (#28667) exposed errors that occur on tasks submitted to the
thread pool and were silently being lost.

Relates #28695
jasontedor added a commit that referenced this issue Feb 16, 2018
This test has a race condition. The action listener used to listen for
connections has a guard against being executed twice. However, this
listener can be executed twice. After on success is invoked the test
starts to tear down. At this point, the threads the test forked will
terminate and the remote cluster connection will be closed. However, a
thread forked to the management thread pool by the remote cluster
connection can still be executing and try to continue connecting. This
thread will be cancelled when the remote cluster connection is closed
and this leads to the action listener being invoked again. To address
this, we explicitly check that the reason that on failure was invoked
was cancellation, and we assert that the listener was already previously
invoked. Interestingly, this issue has always been present yet a recent
change (#28667) exposed errors that occur on tasks submitted to the
thread pool and were silently being lost.

Relates #28695
jasontedor added a commit that referenced this issue Feb 16, 2018
This test has a race condition. The action listener used to listen for
connections has a guard against being executed twice. However, this
listener can be executed twice. After on success is invoked the test
starts to tear down. At this point, the threads the test forked will
terminate and the remote cluster connection will be closed. However, a
thread forked to the management thread pool by the remote cluster
connection can still be executing and try to continue connecting. This
thread will be cancelled when the remote cluster connection is closed
and this leads to the action listener being invoked again. To address
this, we explicitly check that the reason that on failure was invoked
was cancellation, and we assert that the listener was already previously
invoked. Interestingly, this issue has always been present yet a recent
change (#28667) exposed errors that occur on tasks submitted to the
thread pool and were silently being lost.

Relates #28695
@jasontedor
Copy link
Member

Closed by 10666a4

jasontedor added a commit that referenced this issue Feb 16, 2018
This test has a race condition. The action listener used to listen for
connections has a guard against being executed twice. However, this
listener can be executed twice. After on success is invoked the test
starts to tear down. At this point, the threads the test forked will
terminate and the remote cluster connection will be closed. However, a
thread forked to the management thread pool by the remote cluster
connection can still be executing and try to continue connecting. This
thread will be cancelled when the remote cluster connection is closed
and this leads to the action listener being invoked again. To address
this, we explicitly check that the reason that on failure was invoked
was cancellation, and we assert that the listener was already previously
invoked. Interestingly, this issue has always been present yet a recent
change (#28667) exposed errors that occur on tasks submitted to the
thread pool and were silently being lost.

Relates #28695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants