Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cicd: set test-threads to 16 and add retries to reduce flaky failures #1507

Merged
merged 1 commit into from
Jun 20, 2022

Conversation

geekflyer
Copy link
Contributor

@geekflyer geekflyer commented Jun 19, 2022

This adjusts these settings to reduce the chances of flaky tests failing the test build.

Previously tests were implicitly running with test-threads=60 since that's the number of CPU cores our test runners have.
These settings are result of a lots of try and error and thousands of test runs, both on GHA, Circleci and Cirrus-CI.

Note that this essentially just a workaround. Longer term we should find out why certain tests fail when there is a higher number of test threads.

@geekflyer geekflyer changed the title cicd: set test-threads to 16 and add retries to reduce flaky failure cicd: set test-threads to 16 and add retries to reduce flaky failures Jun 19, 2022
@geekflyer geekflyer requested review from JoshLind and davidiw June 19, 2022 23:49
@geekflyer geekflyer force-pushed the flakyfix branch 2 times, most recently from ca336c9 to 78b6e11 Compare June 19, 2022 23:53
@geekflyer geekflyer requested a review from sitalkedia June 19, 2022 23:53
# cancel redundant builds
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canceling redundant builds to reduce load on CI

@@ -11,6 +11,12 @@ on:

env:
HAS_BUILDPULSE_SECRETS: ${{ secrets.BUILDPULSE_ACCESS_KEY_ID != '' && secrets.BUILDPULSE_SECRET_ACCESS_KEY != '' }}
CARGO_INCREMENTAL: "0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly faster builds and smaller cache sizes when disabling incremental compilation as described here https://matklad.github.io/2021/09/04/fast-rust-builds.html

@geekflyer geekflyer force-pushed the flakyfix branch 2 times, most recently from 774e851 to c09d2d8 Compare June 20, 2022 00:04
Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @geekflyer! Question: what's the reason for removing the partition argument?

@geekflyer
Copy link
Contributor Author

geekflyer commented Jun 20, 2022

LGTM, thanks @geekflyer! Question: what's the reason for removing the partition argument?

the partition argument was useless, since it partitioned the testsuite into a single partition and then ran that single partition. The whole idea of the partition argument is to partition the test suite into multiple partitions and then have different test workers / machines run one of the partitions. In other words partition hash:1/1 is always a no-op.

@geekflyer
Copy link
Contributor Author

/land

@github-actions
Copy link
Contributor

Forge run: https://github.com/aptos-labs/aptos-core/actions/runs/2525806940
Forge test result: Forge test runner is terminated

@aptos-bot aptos-bot closed this in e9c6bbe Jun 20, 2022
@aptos-bot aptos-bot merged commit e9c6bbe into main Jun 20, 2022
@aptos-bot aptos-bot deleted the flakyfix branch June 20, 2022 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants