Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

nknize · 2023-07-07T22:27:01Z

Is your feature request related to a problem? Please describe

Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during gradlew check on PRs in the OpenSearch core repository during regular business hours.

The concrete test failures we're noticing are similar to:

Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.opensearch.nio.SocketChannelContext.connect(SocketChannelContext.java:157) ~[opensearch-nio-2.9.0-SNAPSHOT.jar:2.9.0-SNAPSHOT]

As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using Scope.Test level, which fires up a new cluster for each test method).

With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity, --parallel gradle invocation, and size of the Runner instance?

Describe the solution you'd like

As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).

It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).

Describe alternatives you've considered

Check the core Integration Test harness is properly closing sockets
Check the socket pool configuration in the core test framework.
... other core improvements not explicitly mentioned here.

Additional context

Thank you!

The text was updated successfully, but these errors were encountered:

peterzhuamazon · 2023-07-07T23:08:03Z

We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge.
It is possible that Nick his 32/128 but we have 96/192, that means for --parallel to create 3 times more parallel tasks on our instance, each job is being assigned 2 times less the memory.

Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week.

Thanks.

peterzhuamazon · 2023-07-10T22:08:46Z

PRs:

[Python 3.9 Upgrade] Update macos agent to Python 3.9 export and add M58xlarge gradle check runner #323
Add new runner for gradle check on Ubuntu opensearch-build#3721
Tweak description of gradle check to include better agent information opensearch-build#3726
Add AL2023 Docker Host Runner and Make M58xlarge Gradle Check Host #330
Switch validation workflow for docker to run on AL2023 and Gradle Check to new M58xlarge runners opensearch-build#3810

peterzhuamazon · 2023-07-11T16:52:21Z

Test main:

https://build.ci.opensearch.org/job/gradle-check/19916/console

peterzhuamazon · 2023-07-20T23:33:14Z

Several days data shows the new setup would have 90% unstable rate vs 10% success rate, but yet to see complete failure rate yet.

So it is possible the new spec of m58xlarge is better than original c524xlarge setups.

Thanks.

peterzhuamazon · 2023-07-21T17:14:23Z

We have decided to test switching the default runner to m58xlarge next week.

peterzhuamazon · 2023-07-25T17:46:44Z

New spec live. Monitoring a bit.

peterzhuamazon · 2023-07-25T23:04:52Z

More success runs.

bbarani · 2023-07-31T17:09:02Z

Closing this issue as the changes were completed.

nknize added enhancement New feature or request untriaged Issues that have not yet been triaged labels Jul 7, 2023

nknize assigned peterzhuamazon Jul 7, 2023

peterzhuamazon removed the untriaged Issues that have not yet been triaged label Jul 7, 2023

peterzhuamazon added this to OpenSearch Engineering Effectiveness Jul 7, 2023

github-project-automation bot moved this to Backlog in OpenSearch Engineering Effectiveness Jul 7, 2023

peterzhuamazon added bug Something isn't working packer agents and removed bug Something isn't working labels Jul 7, 2023

peterzhuamazon moved this from Backlog to In Progress in OpenSearch Engineering Effectiveness Jul 7, 2023

This was referenced Jul 10, 2023

[Python 3.9 Upgrade] Update macos agent to Python 3.9 export and add M58xlarge gradle check runner #323

Merged

Add new runner for gradle check on Ubuntu opensearch-project/opensearch-build#3721

Merged

peterzhuamazon mentioned this issue Jul 11, 2023

Tweak description of gradle check to include better agent information opensearch-project/opensearch-build#3726

Merged

This was referenced Jul 24, 2023

Add AL2023 Docker Host Runner and Make M58xlarge Gradle Check Host #330

Merged

Switch validation workflow for docker to run on AL2023 and Gradle Check to new M58xlarge runners opensearch-project/opensearch-build#3810

Merged

bbarani closed this as completed Jul 31, 2023

github-project-automation bot moved this from In Progress to Done in OpenSearch Engineering Effectiveness Jul 31, 2023

peterzhuamazon mentioned this issue Oct 4, 2023

Remove old Windows runner and replace Windows Gradle Check Runner AMI ID with M58xlarge #355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

nknize commented Jul 7, 2023

peterzhuamazon commented Jul 7, 2023

peterzhuamazon commented Jul 10, 2023 •

edited

Loading

peterzhuamazon commented Jul 11, 2023

peterzhuamazon commented Jul 20, 2023

peterzhuamazon commented Jul 21, 2023

peterzhuamazon commented Jul 25, 2023 •

edited

Loading

peterzhuamazon commented Jul 25, 2023

bbarani commented Jul 31, 2023

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

Comments

nknize commented Jul 7, 2023

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

peterzhuamazon commented Jul 7, 2023

peterzhuamazon commented Jul 10, 2023 • edited Loading

peterzhuamazon commented Jul 11, 2023

peterzhuamazon commented Jul 20, 2023

peterzhuamazon commented Jul 21, 2023

peterzhuamazon commented Jul 25, 2023 • edited Loading

peterzhuamazon commented Jul 25, 2023

bbarani commented Jul 31, 2023

peterzhuamazon commented Jul 10, 2023 •

edited

Loading

peterzhuamazon commented Jul 25, 2023 •

edited

Loading