Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

Closed
opensearch-ci-bot opened this issue Dec 10, 2021 · 10 comments
Assignees
Labels
bug Something isn't working cicd flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. v2.0.0 Version 2.0.0

Comments

@opensearch-ci-bot
Copy link
Collaborator

@opensearch-project/opensearch-core team, please take a look at Build 1420 in gradle_check log for more information.

@opensearch-ci-bot opensearch-ci-bot added bug Something isn't working cicd labels Dec 10, 2021
@tlfeng
Copy link
Collaborator

tlfeng commented Dec 10, 2021

The test failure in gradle check log:


> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=93B51E5BAF1F40EE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=America/Rosario -Druntime.java=17

org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
    java.util.concurrent.ExecutionException: RemoteTransportException[[node_s3][127.0.0.1:43689][cluster:monitor/health]]; nested: MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];
        at __randomizedtesting.SeedInfo.seed([93B51E5BAF1F40EE:D5A3176ACF55C08]:0)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:281)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:268)
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:99)
        at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)

        Caused by:
        RemoteTransportException[[node_s3][127.0.0.1:43689][cluster:monitor/health]]; nested: MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];

            Caused by:
            MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:275)
                at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:369)
                at org.opensearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:174)
                at org.opensearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:142)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retry(TransportMasterNodeAction.java:258)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retryOnMasterChange(TransportMasterNodeAction.java:239)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.access$200(TransportMasterNodeAction.java:143)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.handleException(TransportMasterNodeAction.java:224)
                at org.opensearch.transport.TransportService$6.handleException(TransportService.java:742)
                at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1357)
                at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:345)
                at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296)
                at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:343)
                at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:335)
                at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:152)
                at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:108)
                at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759)
                at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170)
                at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145)
                at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
                at org.opensearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:341)
                at org.opensearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:246)
                at org.opensearch.nio.BytesChannelContext.read(BytesChannelContext.java:59)
                at org.opensearch.nio.EventHandler.handleRead(EventHandler.java:152)
                at org.opensearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:167)
                at org.opensearch.nio.NioSelector.handleRead(NioSelector.java:438)
                at org.opensearch.nio.NioSelector.processKey(NioSelector.java:264)
                at org.opensearch.nio.NioSelector.singleLoop(NioSelector.java:191)
                at org.opensearch.nio.NioSelector.runLoop(NioSelector.java:148)
                at java.lang.Thread.run(Thread.java:833)

                Caused by:
                NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]
                    at org.opensearch.action.admin.cluster.health.TransportClusterHealthAction$3.onClusterServiceClose(TransportClusterHealthAction.java:274)
                    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:362)
                    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:269)
                    at org.opensearch.cluster.service.ClusterApplierService.doStop(ClusterApplierService.java:189)
                    at org.opensearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:92)
                    at org.opensearch.cluster.service.ClusterService.doStop(ClusterService.java:135)
                    at org.opensearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:92)
                    at org.opensearch.node.Node.stop(Node.java:1187)
                    at org.opensearch.node.Node.close(Node.java:1212)
                    at org.opensearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1098)
                    at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1040)
                    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1971)
                    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1937)
                    at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:380)

@andrross
Copy link
Member

@andrross
Copy link
Member

The assertion that fails is here:

https://github.com/andrross/OpenSearch/blob/996d33adb22ac6a523962166f3677f8dc6ac9c1f/server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java#L393

which is asserting that the response to the get health call is a success, and it fails with with a MasterNotDiscoveredException. I was able to semi-regularly reproduce this error on my Mac, and tracked it down in the debugger to be coming from the timeout here:

https://github.com/andrross/OpenSearch/blob/996d33adb22ac6a523962166f3677f8dc6ac9c1f/server/src/main/java/org/opensearch/action/support/master/TransportMasterNodeAction.java#L275

@nknize nknize changed the title Jenkins Randomized Gradle Check Failed on main [CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure Dec 16, 2021
@nknize
Copy link
Collaborator

nknize commented Dec 16, 2021

Another failure on #1742 PR.

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=58E01485FD8F1669 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Montserrat -Druntime.java=17
org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
    java.util.concurrent.ExecutionException: MasterNotDiscoveredException[NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}];
        at __randomizedtesting.SeedInfo.seed([58E01485FD8F1669:C60F3BA8FE650A8F]:0)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:281)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:268)
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:99)
        at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)

        Caused by:
        MasterNotDiscoveredException[NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}];
            at app//org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:275)
            at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:369)

@dblock
Copy link
Member

dblock commented Dec 22, 2021

Another one in #1791

@dreamer-89
Copy link
Member

Looking into it

@dreamer-89
Copy link
Member

After running tests multiple times, depicts that failure is more prolific when index creation is enabled and master node timeout is lesser.

@dreamer-89
Copy link
Member

dreamer-89 commented Dec 23, 2021

Failure occurrences are related to master node timeout and whether index creation is enabled or not.

A summary of test failures based on above two properties.

Failures % Index Not Used Index Used
Timeout 60 sec (current) 0% ~1%
10 second <0-1% ~50-70%
1 second <1% >90%

@dreamer-89
Copy link
Member

The experiment shows that master node generally takes more time to boot up when indices are used for the test.

Next Steps:

  1. Reduce the flakiness of test by increasing the timeout to 2 minutes.
  2. Dissect master bootup time and check why it is taking longer when indices are used.

@anasalkouz anasalkouz added flaky-test Random test failure that succeeds on second run and removed untriaged labels Dec 28, 2021
@dreamer-89
Copy link
Member

Created new issue to debug/improve higher bootup time for master when indices are used.
#1828

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cicd flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. v2.0.0 Version 2.0.0
Projects
None yet
Development

No branches or pull requests

7 participants