Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Optimize our CircleCI machine sizes #11670

Closed
cameel opened this issue Jul 15, 2021 · 5 comments · Fixed by #12215
Closed

[CI] Optimize our CircleCI machine sizes #11670

cameel opened this issue Jul 15, 2021 · 5 comments · Fixed by #12215

Comments

@cameel
Copy link
Member

cameel commented Jul 15, 2021

In our CI config we only use medium and xlarge machines. There's probably does not give us the best cost/performance ratio. We should try optimizing it a bit.

See list of all available machine types.

On the call with CircleCI people we've been recommended to try the larger Windows and macOS machines to speed up CI runs. Currently we're using medium instances and there are larger ones available. Not sure if that's because larger ones weren't originally available or if we just decided smaller ones are most cost-effective. We should at least try them out and compare performance and cost (if we haven't already)

Also, we have some jobs that do not require much processing and only make use of one core. For example the ones I'm adding in #12181 and #12165 but I'm sure there are more. These can very likely be switched to small without affecting the total runtime too much.

Finally, we should make sure that all the linux jobs that use xlarge actually do benefit from it.

@chriseth
Copy link
Contributor

Please check first if we are using all the cores on the machine. I think the intent back then was that windows and macos failures are rare and thus it is not a big deal if the tests take a bit longer to run.

It would somehow be nice if we could select the machine based on the type of test run: For PRs and releases, it might be important to run quickly, for regular develop builds not so much. Also for PRs, we could default to "slow run" unless specified otherwise.

@chriseth
Copy link
Contributor

If we switch from medium to large, we get double the amount of cores and double the RAM for triple the price. Not sure if that is worth it.

@cameel
Copy link
Member Author

cameel commented Jul 27, 2021

OK. If that's the pricing then it's probably not worth it. I assumed their pricing would be more linear :)

@cameel cameel changed the title Switch to larger macOS/Windows machines on CircleCI [CI] Switch to larger macOS/Windows machines on CircleCI Aug 3, 2021
@cameel cameel changed the title [CI] Switch to larger macOS/Windows machines on CircleCI [CI] Optimize our CircleCI machine sizes Oct 22, 2021
@cameel
Copy link
Member Author

cameel commented Oct 22, 2021

If we switch from medium to large, we get double the amount of cores and double the RAM for triple the price. Not sure if that is worth it.

That's true for Windows but looks like for macOS the cost/performance scaling is more linear (double cores+RAM = double price). So its seems worth trying.

Anyway, after working with CI today I realized that our machine allocation in general could be better optimized. We never use small even though there are some jobs that are pretty light and only use one core. I have updated the title and description to make this issue more general.

@cameel
Copy link
Member Author

cameel commented Oct 28, 2021

Observations

  • Most jobs don't use all CPU cores. We can safely give them smaller machines.
  • I found only two cases where a bigger machine significantly speeds things up:
    1. Arch Linux: We do not set MAKEFLAGS so the build does not even use the 2 cores it has in medium. Even with that a large machine makes it significantly faster.
    2. Windows: large is 2x faster than medium. xlarge is a faster than large but not by much. I'd recommend switching to large. It is indeed 3x more expensive but I think the trade-off is still good enough. We'll be saving a ton on other jobs so we can spend some of these savings on faster Windows builds.
  • Interestingly, switching to large on macOS does not help at all. Maybe the number of parallel jobs is hardcoded somewhere outside of CI config? MAKEFLAGS=-j10 does not help. Might be worth investigating.
  • Some jobs take less time on medium than in the original run (which is also medium in most cases). I think this is because some jobs were missing MAKEFLAGS.

Questions

  • We have codecov check in the nightly run but I have never seen it fail and our codecov config is not very effective anyway. Do we actually need it? It's relatively heavy (xlarge machine for build and test takes 1h on medium).
    • Maybe we should rename b_ubu_codecov to b_ubu_debug? That's what it basically is.

Linux jobs

job current size current xlarge large medium small better size
b_archlinux medium 22m 16s 7m 30s 7m 57s 12m 52s CRASH large
b_bytecode_ems medium 5m 56s 5m 51s 5m 33s 5m 30s 6m 07s small
b_bytecode_ubu medium 8m 38s 7m 53s 8m 40s 8m 23s 9m 25s small
b_docs medium 43s 56s 39s 36s 1m 52s small
b_ems xlarge 7m 50s 7m 39s 8m 14s 10m 30s 20m 18s large
b_ubu xlarge 5m 51s 6m 24s 8m 13s 11m 13s CRASH
b_ubu_asan xlarge 13m 08s 12m 05s 14m 59s 17m 12s CRASH medium
b_ubu_asan_clang medium 19m 59s 7m 48s 10m 25s 12m 27s 27m 17s
b_ubu_clang xlarge 6m 31s 5m 58s 7m 32s 11m 05s 19m 46s large
b_ubu_codecov xlarge 7m 57s 7m 37s 8m 07s 12m 35s CRASH medium
b_ubu_cxx20 xlarge 6m 52s 6m 50s 7m 50s 10m 16s CRASH large
b_ubu_ossfuzz medium 16m 28s 16m 55s 19m 32s 15m 35s CRASH
b_ubu_release xlarge 5m 53s 7m 48s 8m 50s 11m 37s CRASH
b_ubu_static xlarge 8m 03s 7m 59s 8m 17s 10m 29s CRASH medium
b_ubu_ubsan_clang medium 20m 18s 8m 21s 10m 29s 12m 57s 29m 13s
chk_antlr_grammar medium 8m 00s 8m 02s 9m 07s 9m 06s 8m 00s small
chk_buglist medium 16s 16s 32s 14s 17s small
chk_coding_style medium 23s 35s 23s 26s 28s small
chk_docs_pragma_min_version medium 38s 33s 33s 29s 20s small
chk_errorcodes medium 10s 8s 9s 25s 18s small
chk_proofs medium 18s 28s 15s 17s 19s small
chk_pylint medium 35s 39s 47s 36s 31s small
chk_spelling medium 20s 11s 22s 13s 24s small
t_archlinux_soltest medium 15m 21s 13m 03s 13m 54s 12m 20s CRASH
t_bytecode_compare medium 40s 43s 46s 41s 45s small
t_ems_compile_ext_colony medium 2m 32s 2m 17s 2m 15s 2m 23s 3m 07s small
t_ems_compile_ext_gnosis medium 2m 55s 3m 30s 4m 03s 3m 04s 4m 06s small
t_ems_compile_ext_zeppelin medium 6m 08s 4m 22s 5m 02s 5m 18s 5m 21s small
t_ems_ext_hardhat medium 2m 41s 2m 38s 2m 29s 2m 23s 2m 28s small
t_ems_solcjs medium 5m 15s 5m 41s 4m 17s 5m 19s 8m 01s
t_ems_test_ext_colony medium 39m 59s 37m 00s 43m 18s 38m 51s CRASH small
t_ems_test_ext_ens medium 2m 26s 2m 28s 2m 23s 2m 57s 3m 08s small
t_ems_test_ext_gnosis_v2 medium 2m 36s 2m 31s 2m 51s 2m 55s 3m 34s small
t_ems_test_ext_zeppelin medium 9m 41s 8m 01s 7m 42s 8m 01s 9m 09s small
t_ubu_asan_clang_soltest medium 11m 06s 9m 11s 9m 27s 10m 12s CRASH
t_ubu_asan_cli medium 22m 17s 19m 07s 17m 36s 20m 49s 25m 24s small
t_ubu_asan_soltest medium 17m 46s 16m 31s 17m 11s 14m 40s CRASH
t_ubu_clang_soltest medium 14m 44s 14m 28s 14m 24s 15m 00s CRASH
t_ubu_cli medium 8m 36s 8m 20s 7m 33s 7m 54s 9m 33s small
t_ubu_codecov medium 1h 2m 32s 58m 04s 1h 1m 56s 59m 47s CRASH
t_ubu_ossfuzz medium 31m 16s 29m 11s 29m 53s 30m 23s 30m 56s small
t_ubu_pyscripts medium 18s 23s 21s 21s 9s small
t_ubu_release_cli medium 7m 59s 8m 15s 7m 16s 8m 08s 8m 21s small
t_ubu_release_soltest_all medium 32m 57s 32m 11s 32m 15s 31m 51s CRASH
t_ubu_soltest_all medium 32m 19s 32m 51s 31m 38s 33m 25s CRASH
t_ubu_soltest_enforce_yul medium 16m 23s 17m 28s 14m 07s 15m 15s CRASH
t_ubu_ubsan_clang_cli medium 9m 58s 9m 10s 8m 43s 9m 31s 10m 09s small
t_ubu_ubsan_clang_soltest medium 15m 58s 15m 55s 17m 09s 16m 30s CRASH

Windows jobs

job current size current xlarge large medium better size
b_win medium 28m 46s 12m 49s 14m 26s 27m 25s large
b_win_release medium 30m 50s 11m 39s 15m 35s 26m 51s large
b_bytecode_win medium 9m 21s 10m 16s 10m 43s 10m 07s
t_win_release_soltest medium 11m 09s 11m 54s 11m 36s 11m 37s
t_win_soltest medium 10m 35s 11m 24s 11m 49s 11m 10s
t_win_pyscripts medium 2m 1s 52s 42s

macOS jobs

job current size current large medium better size
b_osx medium 19m 42s 21m 38s 20m 51s
b_bytecode_osx medium 13m 13s 13m 39s 13m 07s
t_osx_soltest medium 25m 11s 25m 39s 26m 31s
t_osx_cli medium 16m 34s 17m 50s 17m 46s

Details

  • The original test run was on top of a branch containing commits from Parallelize external tests #12214 (not the same exact branch but close).
    • It also ran the nightly jobs as PR checks so that I could include them in the tables.
    • On top of that branch I created separate branches for different resource classes.
  • t_ubu_ossfuzz job from the nightly run was failing in all runs.
  • Values in each column do not come from the same run in all cases. Getting proper averages over multiple runs would be a lot more work and I had to rerun with fixes a few times. Instead I tried to choose fairly representative values that give a good picture of the situation. Values come from these runs:
    • optimize-ci: 1, 2
    • optimize-ci-resource-class-small: 1, 2, 3, 4
      • in (2), (3) i (4) machine sizes are not uniform: I had to rerun with larger b_ jobs to in cases where there were crashes - to check if the dependent t_ jobs can still work on smaller machines. Also, only (4) had proper MAKEFLAGS set.
    • optimize-ci-resource-class-medium: 1, 2, 3, 4
      • (1), (2) and (3) were run with MAKEFLAGS=j10 which made some jobs run out of memory and crash just like on small machines. In (4) I reran with proper MAKEFLAGS and there were no crashes. The table has only values from that run.
    • optimize-ci-resource-class-large: 1, 2
    • optimize-ci-resource-class-xlarge: 1, 2
    • I took Windows and macOS values from various jobs (their resource classes do not match Linux ones). Also from multicore-soltest-sh: 1.
      • Some runs had 2x longer run times for Windows jobs because I made a mistake when adjusting the options that set number of threads. I did not use those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants