Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Darwin build times out #58286

Closed
nik9000 opened this issue Jun 17, 2020 · 25 comments
Closed

Darwin build times out #58286

nik9000 opened this issue Jun 17, 2020 · 25 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@nik9000
Copy link
Member

nik9000 commented Jun 17, 2020

The darwin build seems to fail due to times a lot. I'm not sure if anything happened recently, but I figure the build folk will have more of the right tools to track it down so I'm tagging them.

Build scan:
scan

Repro line:

./gradlew -p x-pack/plugin check

Reproduces locally?:
No.

Applicable branches: mostly master

Failure history:
link

Failure excerpt:

java.lang.Exception: Suite timeout exceeded (>= 2400000 msec).
@nik9000 nik9000 added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Jun 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jun 17, 2020
@droberts195
Copy link
Contributor

Some more examples are:

The same thing happened last year - see #48148. That was found to be because the macOS worker couldn't cope with running multiple integration test suites in parallel. The suite timeouts were stopped by reducing the parallelism by setting --max-workers=2 on the Gradle invocation on macOS workers.

That change is still in effect (in a file in the private repo that contains the Jenkins config), but something caused the parallelism to increase in macOS CI around 11th June. You can see the effect in the "Build Time Trend" in Jenkins. For example, in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-darwin-compatibility/buildTimeTrend the time taken drops significantly between build 30 and 31, but at the cost of a large proportion of the subsequent builds failing due to suite timeouts.

You can see the parallelism increase in the build scan timelines:

Those two builds both succeeded, but based on the investigation in #48148 I'm sure it's that increase in parallelism that's led to the macOS builds being flakey ever since.

Do the macOS CI workers have spinning disks rather than SSDs? They seem to be staggeringly slow under load. In one of the Gradle scans the log shows it took 6 seconds to install 1 index template:

[2020-06-17T22:20:47,902][INFO ][o.e.c.m.MetadataIndexTemplateService] [node_t0] adding template [.ml-stats] for index patterns [.ml-stats-*]
[2020-06-17T22:20:53,436][INFO ][o.e.c.m.MetadataIndexTemplateService] [node_t0] adding template [.ml-meta] for index patterns [.ml-meta]

Or maybe it's because the macOS worker has less RAM than the Linux workers and is having to use swap to run four test suites at the same time.

@ywelsch
Copy link
Contributor

ywelsch commented Jul 29, 2020

Can someone from @elastic/es-core-infra pick this up please?

@droberts195
Copy link
Contributor

Some more recent failures:

Why did the number of suites run in parallel increase from 1 to 4 on 11th June? Is there an easy way to cut it to 3 and see if that helps?

@pugnascotia
Copy link
Contributor

@mark-vieira I know we've been talking about the Darwin builds - do we have any concrete plans yet?

@mark-vieira
Copy link
Contributor

I think realistically our best option is just to bump the test suite timeouts until we have better macos workers.

rjernst added a commit to rjernst/elasticsearch that referenced this issue Aug 13, 2020
The Darwin CI hosts continue to struggle with timeouts. This commit
increases the timouts for docs and client rest tests.

relates elastic#58286
rjernst added a commit that referenced this issue Aug 14, 2020
The Darwin CI hosts continue to struggle with timeouts. This commit
increases the timouts for docs and client rest tests.

relates #58286
rjernst added a commit that referenced this issue Aug 14, 2020
The Darwin CI hosts continue to struggle with timeouts. This commit
increases the timouts for docs and client rest tests.

relates #58286
@mark-vieira mark-vieira self-assigned this Sep 9, 2020
@astefan
Copy link
Contributor

astefan commented Oct 1, 2020

Another timeout... https://gradle-enterprise.elastic.co/s/jr4i67y5woiz6

@droberts195
Copy link
Contributor

until we have better macos workers.

The current Mac Minis used for ES CI have 32GB RAM. So this explains why they struggle to run 16 test suites in parallel compared to the Linux CI workers that have 128GB RAM.

It looks like it is possible to hire Mac Minis with 64GB RAM from Mac Stadium. These are the biggest Apple has ever made. So they'd still be half the size of the Linux workers and would still probably suffer some spurious timeouts if the CI setup is tuned for the Linux workers. However, doubling the size of the macOS workers would probably mean we suffer a lot less timeouts than we have on macOS today. Might be worth a conversation with Infra (if this isn't happening already)? I don't know what our constraints are around switching machines hired from Mac Stadium.

@nik9000
Copy link
Member Author

nik9000 commented Mar 4, 2021

@ywangd
Copy link
Member

ywangd commented May 6, 2021

Failed again at oldEs1Fixture (7.12) https://gradle-enterprise.elastic.co/s/axym2bhwyrljg

@nik9000
Copy link
Member Author

nik9000 commented May 13, 2021

It looks like the last few darwin builds have timed out cloning:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+platform-support-darwin/14/

Only one of the last fourteen builds succeeded.

@nik9000
Copy link
Member Author

nik9000 commented May 13, 2021

I wonder if the reference repo is empty or something. We do log:

06:22:41 Using reference repository: /var/lib/jenkins/.git-references/elasticsearch.git

When fetching, but if that doesn't have useful commits in it then it can still take a long time to clone.

@benwtrent
Copy link
Member

It may be worth it to create separate issues for all these Mac timeouts, but I am not sure.

Another timeout occurred today: https://gradle-enterprise.elastic.co/s/c4wfdnj2l4oxs/

This one timed out on three separate tests. All REST tests.

This is NOT timing out due to cloning the repo (there is another issue for that)

@benwtrent
Copy link
Member

BWC timeout in 7.x against 6.8.17: https://gradle-enterprise.elastic.co/s/7d4ihrpf4dkcg

@cbuescher
Copy link
Member

Here's another one from today that looks related. :modules:reindex:oldEs1Fixture takes 45s and fails then. https://gradle-enterprise.elastic.co/s/fltzy5qj2grno

@DaveCTurner
Copy link
Contributor

@ywangd
Copy link
Member

ywangd commented Sep 16, 2021

This one seems to be related as well: https://gradle-enterprise.elastic.co/s/em6byt5yt2dlq

@mark-vieira
Copy link
Contributor

I've reached out to infra to get our Mac build agents rebuilt.

@benwtrent
Copy link
Member

Another timeout: https://gradle-enterprise.elastic.co/s/adgded3jzzxvs/

This one was for snapshot restore snapshot builds against azure.

org.elasticsearch.repositories.encrypted.EncryptedAzureBlobStoreRepositoryIntegTests > testLargeBlobCountDeletion FAILED 
   Java.lang.Exception: Test abandoned because suite timeout was reached.
./gradlew ':x-pack:plugin:repository-encrypted:internalClusterTest' --tests "org.elasticsearch.repositories.encrypted.EncryptedAzureBlobStoreRepositoryIntegTests.testLargeBlobCountDeletion" -Dtests.seed=AD9895C5EC067598 -Dtests.locale=nl-NL -Dtests.timezone=Indian/Mayotte -Druntime.java=8

@martijnvg
Copy link
Member

Another Darwin build failure on master: https://gradle-enterprise.elastic.co/s/hdgccgk5fx6di
Several test suites timed out: SearchServiceTests, ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT.

@jkakavas
Copy link
Member

jkakavas commented Feb 9, 2022

https://gradle-enterprise.elastic.co/s/darif25jbecym and https://gradle-enterprise.elastic.co/s/vlpeh4oo6avcs today. ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT

@jbaiera
Copy link
Member

jbaiera commented Mar 17, 2022

Well well well if it isn't ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT again

@benwtrent
Copy link
Member

Yet another darwin timeout: https://gradle-enterprise.elastic.co/s/7rhvm6ivqz3bm

./gradlew ':docs:integTest' --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/mapping/runtime/line_1154}" -Dtests.seed=A4469EAEBAD6BA0D -Dtests.locale=hu -Dtests.timezone=Australia/ACT -Druntime.java=17
org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT > classMethod FAILED
    java.lang.Exception: Suite timeout exceeded (>= 2400000 msec).
        at __randomizedtesting.SeedInfo.seed([A4469EAEBAD6BA0D]:0)

@ywelsch
Copy link
Contributor

ywelsch commented Mar 30, 2022

And just like any test triage day, the failures continue here...

None of the darwin builds seem to ever pass (see https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+8.1+multijob+platform-support-darwin/ or https://elasticsearch-ci.elastic.co/view/All/job/elastic+elasticsearch+main+multijob+platform-support-darwin/), so what's the point of running them if we're not working on a timely fix?

@mark-vieira
Copy link
Contributor

We've removed these jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests