-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: try limiting the memory for 1.10 / nightly CI to get at least some jobs to succeed... #2620
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2620 +/- ##
==========================================
+ Coverage 80.61% 80.68% +0.07%
==========================================
Files 456 457 +1
Lines 64663 64667 +4
==========================================
+ Hits 52125 52175 +50
+ Misses 12538 12492 -46
|
# 10GB should be fine on macos | ||
maxmem = Sys.islinux() ? 3.5*(2^30) : 10*(2^30) | ||
println("OscarCI: Limiting memory to ", Base.format_bytes(maxmem)); | ||
@ccall jl_gc_set_max_memory(maxmem::UInt64)::Cvoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, does that mean using jl_gc_set_max_memory
mitigates some of the GC regressions we are seeing? Perhaps that's a useful clue for debugging, @gbaraldi ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that jl_gc_set_max_memory
is basically the same as setting the --heap-size-hint
(except for some small reserved amount for llvm and julia). But setting it this way was easier to integrate into the CI.
This is perhaps not beautiful, but if it allows us to get passing tests against Julia 1.9 and 1.10 again, that would be good to catch further regressions there. But I am a bit wary that this might then lead to Julia folks to not take this as seriously anymore and punt a fix to after the 1.10 release, which would be rather unfortunate... :-( |
Set max memory working is a good sign at least. We detect that we are running under a container but we don't see all memory that julia is using. Only what the GC allocated, and potentially there's extra memory that we don't see, specially on bigger workloads. Could you try increasing it to like 5GB to see where the limit is. |
I am with Max. We should not monkey patch around a serious regression. Last time I checked, I could not even run the tests locally with 1.10. |
That should be better once JuliaLang/julia#50682 is backported to 1.10 (once beta2 is released). This PR was in it's current state more a test if the tests go through with the disabled tests and not ready for merging. I was away for the last two weeks so I couldn't do much testing, but I will do some more tests this week. First experiment is currently running here: https://github.com/oscar-system/Oscar.jl/actions/runs/5864773536 the extra number in the job name is the max memory in GB that is set for the tests. This might give some ideas about the limit. I still believe it might be helpful to investigate why running the example mentioned in #2599 results in 4.4 GB RSS on nightly but only 2.4 GB on 1.9. |
The github actions overview for that job (with Successful jobs:
Everything else was killed or cancelled. |
Yeah, it's a fine balance where to low makes the GC thrash because it needs more heap and too high makes it get canceled. Could you try JuliaLang/julia#50909. The GC heuristics follow a paper and the paper has an ambiguity, that PR implements the other interpretation, I wonder if it does better here. |
Did not seem to make much of a difference, this time only 3.5GB and 4GB succeeded: https://github.com/oscar-system/Oscar.jl/actions/runs/5866636788 This was with a build artifact from JuliaLang/julia@ |
The latest run after the merge of master looks way worse than a month ago, I restarted the tests to see if that happens again. 1.6 took 1:44, 1.9 timed out after 2.5 hours. And the only change affecting these versions is that two demanding test-groups are moved to the end.
1.10 and nightly also died, even without these large testgroups. A month ago 1.10 and nightly succeeded (with the changes in this PR) after 1 and 1.5 hours. 1.6 and 1.9 succeeded after 1:26 and 1:42. |
These changes would work around some of the issues from #2441, the tests (for nightly + 1.10) do go through with the added hint size and disabled tests, maybe after one or two restarts.
See for example here or here.
The values were determined through several experiments on github actions.
I am setting the hint-size-heap via
ccall
to make it easier to compute and inject the value while julia is already running.We could in addition run the long + large tests that I disabled in a separate job instead.
This would probably allow us to catch other issues with 1.10 (or nightly) better than in the current state.
cc: @thofma @fingolfin