Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620

Closed
wants to merge 3 commits into from

Conversation

benlorenz
Copy link
Member

These changes would work around some of the issues from #2441, the tests (for nightly + 1.10) do go through with the added hint size and disabled tests, maybe after one or two restarts.
See for example here or here.

The values were determined through several experiments on github actions.

I am setting the hint-size-heap via ccall to make it easier to compute and inject the value while julia is already running.
We could in addition run the long + large tests that I disabled in a separate job instead.

This would probably allow us to catch other issues with 1.10 (or nightly) better than in the current state.

cc: @thofma @fingolfin

@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Merging #2620 (675513d) into master (35bf9a4) will increase coverage by 0.07%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #2620      +/-   ##
==========================================
+ Coverage   80.61%   80.68%   +0.07%     
==========================================
  Files         456      457       +1     
  Lines       64663    64667       +4     
==========================================
+ Hits        52125    52175      +50     
+ Misses      12538    12492      -46     
Files Changed Coverage
src/utils/limit_mem.jl 50.00%

# 10GB should be fine on macos
maxmem = Sys.islinux() ? 3.5*(2^30) : 10*(2^30)
println("OscarCI: Limiting memory to ", Base.format_bytes(maxmem));
@ccall jl_gc_set_max_memory(maxmem::UInt64)::Cvoid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, does that mean using jl_gc_set_max_memory mitigates some of the GC regressions we are seeing? Perhaps that's a useful clue for debugging, @gbaraldi ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that jl_gc_set_max_memory is basically the same as setting the --heap-size-hint (except for some small reserved amount for llvm and julia). But setting it this way was easier to integrate into the CI.

@fingolfin
Copy link
Member

This is perhaps not beautiful, but if it allows us to get passing tests against Julia 1.9 and 1.10 again, that would be good to catch further regressions there. But I am a bit wary that this might then lead to Julia folks to not take this as seriously anymore and punt a fix to after the 1.10 release, which would be rather unfortunate... :-(

@gbaraldi
Copy link

Set max memory working is a good sign at least. We detect that we are running under a container but we don't see all memory that julia is using. Only what the GC allocated, and potentially there's extra memory that we don't see, specially on bigger workloads.

Could you try increasing it to like 5GB to see where the limit is.

@thofma
Copy link
Collaborator

thofma commented Aug 15, 2023

I am with Max. We should not monkey patch around a serious regression. Last time I checked, I could not even run the tests locally with 1.10.

@benlorenz
Copy link
Member Author

Last time I checked, I could not even run the tests locally with 1.10.

That should be better once JuliaLang/julia#50682 is backported to 1.10 (once beta2 is released).
As far as I can see that change did bring some improvements but the very large testsets still cause the testsuite to fail on master. The doctests were successful on master after that was merged.

This PR was in it's current state more a test if the tests go through with the disabled tests and not ready for merging.

I was away for the last two weeks so I couldn't do much testing, but I will do some more tests this week.

First experiment is currently running here: https://github.com/oscar-system/Oscar.jl/actions/runs/5864773536 the extra number in the job name is the max memory in GB that is set for the tests. This might give some ideas about the limit.

I still believe it might be helpful to investigate why running the example mentioned in #2599 results in 4.4 GB RSS on nightly but only 2.4 GB on 1.9.

@benlorenz
Copy link
Member Author

First experiment is currently running here: oscar-system/Oscar.jl/actions/runs/5864773536 the extra number in the job name is the max memory in GB that is set for the tests. This might give some ideas about the limit.

The github actions overview for that job (with 2:0.5:6 GB limits for Linux) looks way better that it really is, plenty of jobs have a green checkmark but were in fact cancelled after 2.5 hours.

Successful jobs:

  • macOS: both jobs (with 8GB and 11GB limit, of 14GB total)
  • 1.10: 3GB, 4GB, 4.5GB
  • nightly: 4GB, 4.5GB

Everything else was killed or cancelled.

@gbaraldi
Copy link

Yeah, it's a fine balance where to low makes the GC thrash because it needs more heap and too high makes it get canceled.

Could you try JuliaLang/julia#50909. The GC heuristics follow a paper and the paper has an ambiguity, that PR implements the other interpretation, I wonder if it does better here.

@benlorenz
Copy link
Member Author

benlorenz commented Aug 15, 2023

Yeah, it's a fine balance where to low makes the GC thrash because it needs more heap and too high makes it get canceled.

Could you try JuliaLang/julia#50909. The GC heuristics follow a paper and the paper has an ambiguity, that PR implements the other interpretation, I wonder if it does better here.

Did not seem to make much of a difference, this time only 3.5GB and 4GB succeeded: https://github.com/oscar-system/Oscar.jl/actions/runs/5866636788
6GB was probably oom-killed, the other jobs timed out again.
From a quick look at the line-numbers it looks like there were significantly more GC calls, about 8k lines of output in the (successful) 4GB job on nightly vs 6k lines in the previous 4GB job. (The test output should not change much but the GC logging adds extra lines)

This was with a build artifact from JuliaLang/julia@dd5a826 using julia-actions/install-julia-from-url.

@benlorenz
Copy link
Member Author

The latest run after the merge of master looks way worse than a month ago, I restarted the tests to see if that happens again.

1.6 took 1:44, 1.9 timed out after 2.5 hours. And the only change affecting these versions is that two demanding test-groups are moved to the end.
From the julia 1.9 log:

elliptic surfaces |    2      2  28m46.7s

1.10 and nightly also died, even without these large testgroups.

A month ago 1.10 and nightly succeeded (with the changes in this PR) after 1 and 1.5 hours. 1.6 and 1.9 succeeded after 1:26 and 1:42.

@benlorenz benlorenz closed this Feb 14, 2024
@benlorenz benlorenz deleted the bl/ci_limit_mem branch February 14, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants