CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

benlorenz · 2023-08-04T06:18:04Z

These changes would work around some of the issues from #2441, the tests (for nightly + 1.10) do go through with the added hint size and disabled tests, maybe after one or two restarts.
See for example here or here.

The values were determined through several experiments on github actions.

I am setting the hint-size-heap via ccall to make it easier to compute and inject the value while julia is already running.
We could in addition run the long + large tests that I disabled in a separate job instead.

This would probably allow us to catch other issues with 1.10 (or nightly) better than in the current state.

cc: @thofma @fingolfin

…e jobs to succeed...

codecov · 2023-08-04T06:34:31Z

Codecov Report

Merging #2620 (675513d) into master (35bf9a4) will increase coverage by 0.07%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #2620      +/-   ##
==========================================
+ Coverage   80.61%   80.68%   +0.07%     
==========================================
  Files         456      457       +1     
  Lines       64663    64667       +4     
==========================================
+ Hits        52125    52175      +50     
+ Misses      12538    12492      -46

Files Changed	Coverage
src/utils/limit_mem.jl	`50.00%`

fingolfin · 2023-08-14T23:38:52Z

src/utils/limit_mem.jl

+  # 10GB should be fine on macos
+  maxmem = Sys.islinux() ? 3.5*(2^30) : 10*(2^30)
+  println("OscarCI: Limiting memory to ", Base.format_bytes(maxmem));
+  @ccall jl_gc_set_max_memory(maxmem::UInt64)::Cvoid


Wait, does that mean using jl_gc_set_max_memory mitigates some of the GC regressions we are seeing? Perhaps that's a useful clue for debugging, @gbaraldi ?

Note that jl_gc_set_max_memory is basically the same as setting the --heap-size-hint (except for some small reserved amount for llvm and julia). But setting it this way was easier to integrate into the CI.

fingolfin · 2023-08-15T00:05:35Z

This is perhaps not beautiful, but if it allows us to get passing tests against Julia 1.9 and 1.10 again, that would be good to catch further regressions there. But I am a bit wary that this might then lead to Julia folks to not take this as seriously anymore and punt a fix to after the 1.10 release, which would be rather unfortunate... :-(

gbaraldi · 2023-08-15T00:13:52Z

Set max memory working is a good sign at least. We detect that we are running under a container but we don't see all memory that julia is using. Only what the GC allocated, and potentially there's extra memory that we don't see, specially on bigger workloads.

Could you try increasing it to like 5GB to see where the limit is.

thofma · 2023-08-15T07:12:50Z

I am with Max. We should not monkey patch around a serious regression. Last time I checked, I could not even run the tests locally with 1.10.

benlorenz · 2023-08-15T07:52:33Z

Last time I checked, I could not even run the tests locally with 1.10.

That should be better once JuliaLang/julia#50682 is backported to 1.10 (once beta2 is released).
As far as I can see that change did bring some improvements but the very large testsets still cause the testsuite to fail on master. The doctests were successful on master after that was merged.

This PR was in it's current state more a test if the tests go through with the disabled tests and not ready for merging.

I was away for the last two weeks so I couldn't do much testing, but I will do some more tests this week.

First experiment is currently running here: https://github.com/oscar-system/Oscar.jl/actions/runs/5864773536 the extra number in the job name is the max memory in GB that is set for the tests. This might give some ideas about the limit.

I still believe it might be helpful to investigate why running the example mentioned in #2599 results in 4.4 GB RSS on nightly but only 2.4 GB on 1.9.

benlorenz · 2023-08-15T10:20:32Z

First experiment is currently running here: oscar-system/Oscar.jl/actions/runs/5864773536 the extra number in the job name is the max memory in GB that is set for the tests. This might give some ideas about the limit.

The github actions overview for that job (with 2:0.5:6 GB limits for Linux) looks way better that it really is, plenty of jobs have a green checkmark but were in fact cancelled after 2.5 hours.

Successful jobs:

macOS: both jobs (with 8GB and 11GB limit, of 14GB total)
1.10: 3GB, 4GB, 4.5GB
nightly: 4GB, 4.5GB

Everything else was killed or cancelled.

gbaraldi · 2023-08-15T11:10:23Z

Yeah, it's a fine balance where to low makes the GC thrash because it needs more heap and too high makes it get canceled.

Could you try JuliaLang/julia#50909. The GC heuristics follow a paper and the paper has an ambiguity, that PR implements the other interpretation, I wonder if it does better here.

benlorenz · 2023-08-15T17:17:26Z

Yeah, it's a fine balance where to low makes the GC thrash because it needs more heap and too high makes it get canceled.

Could you try JuliaLang/julia#50909. The GC heuristics follow a paper and the paper has an ambiguity, that PR implements the other interpretation, I wonder if it does better here.

Did not seem to make much of a difference, this time only 3.5GB and 4GB succeeded: https://github.com/oscar-system/Oscar.jl/actions/runs/5866636788
6GB was probably oom-killed, the other jobs timed out again.
From a quick look at the line-numbers it looks like there were significantly more GC calls, about 8k lines of output in the (successful) 4GB job on nightly vs 6k lines in the previous 4GB job. (The test output should not change much but the GC logging adds extra lines)

This was with a build artifact from JuliaLang/julia@dd5a826 using julia-actions/install-julia-from-url.

benlorenz · 2023-09-20T18:23:53Z

The latest run after the merge of master looks way worse than a month ago, I restarted the tests to see if that happens again.

1.6 took 1:44, 1.9 timed out after 2.5 hours. And the only change affecting these versions is that two demanding test-groups are moved to the end.
From the julia 1.9 log:

elliptic surfaces |    2      2  28m46.7s

1.10 and nightly also died, even without these large testgroups.

A month ago 1.10 and nightly succeeded (with the changes in this PR) after 1 and 1.5 hours. 1.6 and 1.9 succeeded after 1:26 and 1:42.

benlorenz added 2 commits August 2, 2023 14:04

CI: try limiting the memory for 1.10 / nightly CI to get at least som…

3331694

…e jobs to succeed...

disable some memory intensive test for julia >= 1.10 on github actions

1c66d77

fingolfin reviewed Aug 14, 2023

View reviewed changes

Merge branch 'master' into bl/ci_limit_mem

675513d

benlorenz closed this Feb 14, 2024

benlorenz deleted the bl/ci_limit_mem branch February 14, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

benlorenz commented Aug 4, 2023

codecov bot commented Aug 4, 2023 •

edited

Loading

fingolfin Aug 14, 2023

benlorenz Aug 15, 2023

fingolfin commented Aug 15, 2023

gbaraldi commented Aug 15, 2023

thofma commented Aug 15, 2023

benlorenz commented Aug 15, 2023

benlorenz commented Aug 15, 2023

gbaraldi commented Aug 15, 2023

benlorenz commented Aug 15, 2023 •

edited

Loading

benlorenz commented Sep 20, 2023

CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

Conversation

benlorenz commented Aug 4, 2023

codecov bot commented Aug 4, 2023 • edited Loading

Codecov Report

fingolfin Aug 14, 2023

Choose a reason for hiding this comment

benlorenz Aug 15, 2023

Choose a reason for hiding this comment

fingolfin commented Aug 15, 2023

gbaraldi commented Aug 15, 2023

thofma commented Aug 15, 2023

benlorenz commented Aug 15, 2023

benlorenz commented Aug 15, 2023

gbaraldi commented Aug 15, 2023

benlorenz commented Aug 15, 2023 • edited Loading

benlorenz commented Sep 20, 2023

codecov bot commented Aug 4, 2023 •

edited

Loading

benlorenz commented Aug 15, 2023 •

edited

Loading