[CI] Optimize our CircleCI machine sizes #11670

cameel · 2021-07-15T18:27:49Z

In our CI config we only use medium and xlarge machines. There's probably does not give us the best cost/performance ratio. We should try optimizing it a bit.

See list of all available machine types.

On the call with CircleCI people we've been recommended to try the larger Windows and macOS machines to speed up CI runs. Currently we're using medium instances and there are larger ones available. Not sure if that's because larger ones weren't originally available or if we just decided smaller ones are most cost-effective. We should at least try them out and compare performance and cost (if we haven't already)

Also, we have some jobs that do not require much processing and only make use of one core. For example the ones I'm adding in #12181 and #12165 but I'm sure there are more. These can very likely be switched to small without affecting the total runtime too much.

Finally, we should make sure that all the linux jobs that use xlarge actually do benefit from it.

The text was updated successfully, but these errors were encountered:

chriseth · 2021-07-15T21:26:28Z

Please check first if we are using all the cores on the machine. I think the intent back then was that windows and macos failures are rare and thus it is not a big deal if the tests take a bit longer to run.

It would somehow be nice if we could select the machine based on the type of test run: For PRs and releases, it might be important to run quickly, for regular develop builds not so much. Also for PRs, we could default to "slow run" unless specified otherwise.

chriseth · 2021-07-26T16:03:33Z

If we switch from medium to large, we get double the amount of cores and double the RAM for triple the price. Not sure if that is worth it.

cameel · 2021-07-27T14:57:47Z

OK. If that's the pricing then it's probably not worth it. I assumed their pricing would be more linear :)

cameel · 2021-10-22T19:49:38Z

If we switch from medium to large, we get double the amount of cores and double the RAM for triple the price. Not sure if that is worth it.

That's true for Windows but looks like for macOS the cost/performance scaling is more linear (double cores+RAM = double price). So its seems worth trying.

Anyway, after working with CI today I realized that our machine allocation in general could be better optimized. We never use small even though there are some jobs that are pretty light and only use one core. I have updated the title and description to make this issue more general.

cameel · 2021-10-28T23:29:24Z

Observations

Most jobs don't use all CPU cores. We can safely give them smaller machines.
I found only two cases where a bigger machine significantly speeds things up:
1. Arch Linux: We do not set MAKEFLAGS so the build does not even use the 2 cores it has in medium. Even with that a large machine makes it significantly faster.
2. Windows: large is 2x faster than medium. xlarge is a faster than large but not by much. I'd recommend switching to large. It is indeed 3x more expensive but I think the trade-off is still good enough. We'll be saving a ton on other jobs so we can spend some of these savings on faster Windows builds.
Interestingly, switching to large on macOS does not help at all. Maybe the number of parallel jobs is hardcoded somewhere outside of CI config? MAKEFLAGS=-j10 does not help. Might be worth investigating.
Some jobs take less time on medium than in the original run (which is also medium in most cases). I think this is because some jobs were missing MAKEFLAGS.

Questions

We have codecov check in the nightly run but I have never seen it fail and our codecov config is not very effective anyway. Do we actually need it? It's relatively heavy (xlarge machine for build and test takes 1h on medium).
- Maybe we should rename b_ubu_codecov to b_ubu_debug? That's what it basically is.

Linux jobs

job	current size	current	xlarge	large	medium	small	better size
`b_archlinux`	medium	22m 16s	7m 30s	7m 57s	12m 52s	CRASH	large
`b_bytecode_ems`	medium	5m 56s	5m 51s	5m 33s	5m 30s	6m 07s	small
`b_bytecode_ubu`	medium	8m 38s	7m 53s	8m 40s	8m 23s	9m 25s	small
`b_docs`	medium	43s	56s	39s	36s	1m 52s	small
`b_ems`	xlarge	7m 50s	7m 39s	8m 14s	10m 30s	20m 18s	large
`b_ubu`	xlarge	5m 51s	6m 24s	8m 13s	11m 13s	CRASH
`b_ubu_asan`	xlarge	13m 08s	12m 05s	14m 59s	17m 12s	CRASH	medium
`b_ubu_asan_clang`	medium	19m 59s	7m 48s	10m 25s	12m 27s	27m 17s
`b_ubu_clang`	xlarge	6m 31s	5m 58s	7m 32s	11m 05s	19m 46s	large
`b_ubu_codecov`	xlarge	7m 57s	7m 37s	8m 07s	12m 35s	CRASH	medium
`b_ubu_cxx20`	xlarge	6m 52s	6m 50s	7m 50s	10m 16s	CRASH	large
`b_ubu_ossfuzz`	medium	16m 28s	16m 55s	19m 32s	15m 35s	CRASH
`b_ubu_release`	xlarge	5m 53s	7m 48s	8m 50s	11m 37s	CRASH
`b_ubu_static`	xlarge	8m 03s	7m 59s	8m 17s	10m 29s	CRASH	medium
`b_ubu_ubsan_clang`	medium	20m 18s	8m 21s	10m 29s	12m 57s	29m 13s
`chk_antlr_grammar`	medium	8m 00s	8m 02s	9m 07s	9m 06s	8m 00s	small
`chk_buglist`	medium	16s	16s	32s	14s	17s	small
`chk_coding_style`	medium	23s	35s	23s	26s	28s	small
`chk_docs_pragma_min_version`	medium	38s	33s	33s	29s	20s	small
`chk_errorcodes`	medium	10s	8s	9s	25s	18s	small
`chk_proofs`	medium	18s	28s	15s	17s	19s	small
`chk_pylint`	medium	35s	39s	47s	36s	31s	small
`chk_spelling`	medium	20s	11s	22s	13s	24s	small
`t_archlinux_soltest`	medium	15m 21s	13m 03s	13m 54s	12m 20s	CRASH
`t_bytecode_compare`	medium	40s	43s	46s	41s	45s	small
`t_ems_compile_ext_colony`	medium	2m 32s	2m 17s	2m 15s	2m 23s	3m 07s	small
`t_ems_compile_ext_gnosis`	medium	2m 55s	3m 30s	4m 03s	3m 04s	4m 06s	small
`t_ems_compile_ext_zeppelin`	medium	6m 08s	4m 22s	5m 02s	5m 18s	5m 21s	small
`t_ems_ext_hardhat`	medium	2m 41s	2m 38s	2m 29s	2m 23s	2m 28s	small
`t_ems_solcjs`	medium	5m 15s	5m 41s	4m 17s	5m 19s	8m 01s
`t_ems_test_ext_colony`	medium	39m 59s	37m 00s	43m 18s	38m 51s	CRASH	small
`t_ems_test_ext_ens`	medium	2m 26s	2m 28s	2m 23s	2m 57s	3m 08s	small
`t_ems_test_ext_gnosis_v2`	medium	2m 36s	2m 31s	2m 51s	2m 55s	3m 34s	small
`t_ems_test_ext_zeppelin`	medium	9m 41s	8m 01s	7m 42s	8m 01s	9m 09s	small
`t_ubu_asan_clang_soltest`	medium	11m 06s	9m 11s	9m 27s	10m 12s	CRASH
`t_ubu_asan_cli`	medium	22m 17s	19m 07s	17m 36s	20m 49s	25m 24s	small
`t_ubu_asan_soltest`	medium	17m 46s	16m 31s	17m 11s	14m 40s	CRASH
`t_ubu_clang_soltest`	medium	14m 44s	14m 28s	14m 24s	15m 00s	CRASH
`t_ubu_cli`	medium	8m 36s	8m 20s	7m 33s	7m 54s	9m 33s	small
`t_ubu_codecov`	medium	1h 2m 32s	58m 04s	1h 1m 56s	59m 47s	CRASH
`t_ubu_ossfuzz`	medium	31m 16s	29m 11s	29m 53s	30m 23s	30m 56s	small
`t_ubu_pyscripts`	medium	18s	23s	21s	21s	9s	small
`t_ubu_release_cli`	medium	7m 59s	8m 15s	7m 16s	8m 08s	8m 21s	small
`t_ubu_release_soltest_all`	medium	32m 57s	32m 11s	32m 15s	31m 51s	CRASH
`t_ubu_soltest_all`	medium	32m 19s	32m 51s	31m 38s	33m 25s	CRASH
`t_ubu_soltest_enforce_yul`	medium	16m 23s	17m 28s	14m 07s	15m 15s	CRASH
`t_ubu_ubsan_clang_cli`	medium	9m 58s	9m 10s	8m 43s	9m 31s	10m 09s	small
`t_ubu_ubsan_clang_soltest`	medium	15m 58s	15m 55s	17m 09s	16m 30s	CRASH

Windows jobs

job	current size	current	xlarge	large	medium	better size
`b_win`	medium	28m 46s	12m 49s	14m 26s	27m 25s	large
`b_win_release`	medium	30m 50s	11m 39s	15m 35s	26m 51s	large
`b_bytecode_win`	medium	9m 21s	10m 16s	10m 43s	10m 07s
`t_win_release_soltest`	medium	11m 09s	11m 54s	11m 36s	11m 37s
`t_win_soltest`	medium	10m 35s	11m 24s	11m 49s	11m 10s
`t_win_pyscripts`	medium		2m 1s	52s	42s

macOS jobs

job	current size	current	large	medium
`b_osx`	medium	19m 42s	21m 38s	20m 51s
`b_bytecode_osx`	medium	13m 13s	13m 39s	13m 07s
`t_osx_soltest`	medium	25m 11s	25m 39s	26m 31s
`t_osx_cli`	medium	16m 34s	17m 50s	17m 46s

Details

The original test run was on top of a branch containing commits from Parallelize external tests #12214 (not the same exact branch but close).
- It also ran the nightly jobs as PR checks so that I could include them in the tables.
- On top of that branch I created separate branches for different resource classes.
t_ubu_ossfuzz job from the nightly run was failing in all runs.
Values in each column do not come from the same run in all cases. Getting proper averages over multiple runs would be a lot more work and I had to rerun with fixes a few times. Instead I tried to choose fairly representative values that give a good picture of the situation. Values come from these runs:
- optimize-ci: 1, 2
- optimize-ci-resource-class-small: 1, 2, 3, 4
  - in (2), (3) i (4) machine sizes are not uniform: I had to rerun with larger b_ jobs to in cases where there were crashes - to check if the dependent t_ jobs can still work on smaller machines. Also, only (4) had proper MAKEFLAGS set.
- optimize-ci-resource-class-medium: 1, 2, 3, 4
  - (1), (2) and (3) were run with MAKEFLAGS=j10 which made some jobs run out of memory and crash just like on small machines. In (4) I reran with proper MAKEFLAGS and there were no crashes. The table has only values from that run.
- optimize-ci-resource-class-large: 1, 2
- optimize-ci-resource-class-xlarge: 1, 2
- I took Windows and macOS values from various jobs (their resource classes do not match Linux ones). Also from multicore-soltest-sh: 1.
  - Some runs had 2x longer run times for Windows jobs because I made a mistake when adjusting the options that set number of threads. I did not use those.

cameel added build system 🏗️ performance 🐎 labels Jul 15, 2021

cameel changed the title ~~Switch to larger macOS/Windows machines on CircleCI~~ [CI] Switch to larger macOS/Windows machines on CircleCI Aug 3, 2021

cameel changed the title ~~[CI] Switch to larger macOS/Windows machines on CircleCI~~ [CI] Optimize our CircleCI machine sizes Oct 22, 2021

cameel mentioned this issue Oct 28, 2021

[CI] Optimize CI resource usage #12215

Merged

cameel mentioned this issue Dec 1, 2021

Language Server #11350

Merged

chriseth closed this as completed in #12215 Dec 6, 2021

cameel mentioned this issue Dec 16, 2021

Parallelize macOS build in CI #12422

Closed

cameel mentioned this issue Jan 14, 2022

[CI] Use large resource class for macOS and use -j without spaces #12536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Optimize our CircleCI machine sizes #11670

[CI] Optimize our CircleCI machine sizes #11670

cameel commented Jul 15, 2021 •

edited

Loading

chriseth commented Jul 15, 2021

chriseth commented Jul 26, 2021

cameel commented Jul 27, 2021

cameel commented Oct 22, 2021

cameel commented Oct 28, 2021

[CI] Optimize our CircleCI machine sizes #11670

[CI] Optimize our CircleCI machine sizes #11670

Comments

cameel commented Jul 15, 2021 • edited Loading

chriseth commented Jul 15, 2021

chriseth commented Jul 26, 2021

cameel commented Jul 27, 2021

cameel commented Oct 22, 2021

cameel commented Oct 28, 2021

Observations

Questions

Linux jobs

Windows jobs

macOS jobs

Details

cameel commented Jul 15, 2021 •

edited

Loading