-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally enable GC for front-end #2916
Conversation
On Win64, I can now compile all default libs (incl. unittests) without crashing. I don't see the memory drop though in the VS tools (and neither in the task manager); adding an explicit |
I've also tried NOT disabling the GC at startup; that seems to work too, at least for 2 Phobos module unittests, and does reduce the memory footprint ( |
I don't think the GC (or regular free) would return pages to the OS. |
|
Yeah I think it's the fragmentation, as in a few cases,
|
GC.minimize only returns memory to the OS if a complete pool (usually 4MB or larger IIRC) is unused, so very unlikely for "small" object pools. Might happen for "large" allocations (>2k) eventually. |
Did you also measure impact on compile times? Last time I tried that it was some 10-20% IIRC. My actual target was to compile phobos unittests all at once, but that still used more memory than my system was able to provide (16GB) and crashed at some point. |
That's what I thought. Is the pool/page size configurable via cmdline switch? I haven't checked the compile times yet; the single collect + minimize appears pretty cheap though, that's why I wanted to focus on that. Enabling the GC from the start is surely a different matter, but probably not that costly (relatively) with I also tried compiling the Phobos unittests all at once a while back, and IIRC, it was something like 16 GB (edit: or maybe 16-24, I really don't remember; my system has 32), and was also very slow compared to parallel separate compilation with my 4 cores. |
You can tweak it with
|
Ah yeah nice. LDC doesn't use the D main args, but the druntime args, i.e., these DRT options aren't filtered out and so unusable with LDC yet. Still, I tried embedding the option as global: extern(C) __gshared string[] rt_options = [ "gcopt=maxPoolSize:1" ]; That has quite an effect for the regex unittests, reducing the total size from 1.6 GB (see above) with default settings to 0.8:
For the
|
It'd be interesting to know where all the temporary allocations come from. If it turns out to be mostly CTFE/the interpreter (just a guess), using a reference-counted scheme or some sort of separate GC domain for these parts would certainly help wrt. comparatively high memory requirements of D compilers. While I'm not too much interested in lower memory requirements myself yet, it's a recurring problem for CI, limiting parallelization (e.g., I have to build serially for SemaphoreCI, as we only have ~2.4 GB of memory when using Docker). And If I had a Ryzen with 16 logical cores, I'd prefer not having to have something like 32 GB of memory to make use of all of them when compiling the unittests. I guess it's similar for Joakim's phone with 8 cores. |
Oh wow. I've just tried additionally enabling the GC selectively during --- a/dmd/dinterpret.d
+++ b/dmd/dinterpret.d
@@ -54,6 +54,8 @@ import dmd.visitor;
*/
public Expression ctfeInterpret(Expression e)
{
+ import core.memory;
+
if (e.op == TOK.error)
return e;
assert(e.type); // https://issues.dlang.org/show_bug.cgi?id=14642
@@ -61,6 +63,8 @@ public Expression ctfeInterpret(Expression e)
if (e.type.ty == Terror)
return new ErrorExp();
+ GC.enable();
+
// This code is outside a function, but still needs to be compiled
// (there are compiler-generated temporary variables such as __dollar).
// However, this will only be run once and can then be discarded.
@@ -75,6 +79,8 @@ public Expression ctfeInterpret(Expression e)
if (CTFEExp.isCantExp(result))
result = new ErrorExp();
+ GC.disable();
+
return result;
} Results, without
Additionally with
This is a cut-down of the regex unittests front-end (OS) memory requirements by 75% (2 GB -> 0.5 GB). |
Runtime-wise, selectively enabling the GC makes almost no difference compared to having it enabled all the time, while enabling it from the start yields a 10% lower total size for the |
Sounds promising.
I guess a collection that has been avoided while the GC is disabled will always be done with the next allocation during CTFE.
That's consistent with what I remember.
That's probably because it will cause more collections whenever the next pool is exhausted. At that size each collection will take considerable time. You can show some stats with gcopt |
f26ce12
to
6a1280b
Compare
Damn - just enabling
So while the savings with |
I guess a command line switch would have to select the allocation method GC/bump-pointer at runtime in rmem.d. Memory overhead is expected to be about 25% for the GC as it allocates in blocks that are a power of 2 (up to 8k). Building phobos without unittests is probably not a good candidate for freeing memory as most data is still referenced in the AST. There have to be more intermediate results caused by CTFE, for example. |
Well, for Win64 the actually referenced data after a final collection was 544 MB (one third of the allocated GC memory), see the first post. |
The difficulty is that the non-GC version also overrides the allocation functions (
In the Phobos all-at-once case it's 34% (edit: overall memory, incl. C++ heap, so it's even higher). Do you think the implicit |
What's apparently eating lots of memory is CTFE and templates. IIRC, building the |
You mean for each instance? So if we have |
With Rainer's druntime PR above, this is beginning to look useful. Wall-clock + max resident set sizes (in MB and for the full process, not just GC memory) for some compilations (no optimizations, enabled LDC/LLVM assertions) on Linux x64, comparing
The memory requirements can most likely be further lowered by tweaking the GC params. |
I got rid of the CMake option and now switch at runtime; it's now either running with an enabled GC ( |
Looks good. It seems the default GC options strike the best balance between speed and memory size. I wonder how much the enforced collection before termination costs (https://github.com/dlang/druntime/blob/master/src/gc/proxy.d#L94). As it is unnecessary for 99% of all programs I think it should be hidden behind a GC option. |
Yeah the defaults seem reasonable; playing with a slightly lower I guess the one more collection at shutdown probably doesn't really matter here; the finalization cost is probably very low for the front-end ( |
@rainers: Do you think it'd be worth it trying to upstream the 2 modes at runtime (& |
I'd approve, but it might get some opposition from nano-cycle-counting people (that seem fine with calling the virtual function ti.initializer multiple times). Maybe it can still be versioned, but integrated with the other code, i.e. 2 Observations:
|
That's what I was afraid of ;) - but yeah, keeping
I think it should;
Yep; I took the
Makes sense. |
Rebased (2.085 druntime features the GC improvements wrt. less memory overhead) and resynced with the upstream PR, incl. |
@rainers: Wow, I'm now getting much better timings for
My previous test (see above) was on Linux (and using an Can this huge speed-up (3.5x) be attributed to your 2.085 GC improvements? Edit: That module was split in 2 with 2.085. |
For those regex tests, I also see a consistent ~3% runtime improvement when switching from bump-ptr to a simple disabled GC without cleanup ( |
The only explicit performance improvements was for sweeping large allocations (with maybe a slight effect on allocation, too), but I doubt there will be many large allocations (>2kB) by dmd.
Not sure why it can be faster, but it's similar to my observation with compiling phobos not showing any difference within the accuracy of measurements. Maybe it's because the GC malloc/free functions are faster than C malloc/free for Mem.xmalloc/xfree. |
This isn't looking too bad, although druntime cannot be compiled yet:
Phobos all-at-once compilation, Win64:
std.string
unittests:std.algorithm.iteration
unittests:std.regex.internal.tests
unittests:It doesn't reduce the peak memory requirements (well, maybe a little bit if LLVM allocates significant amounts during codegen), but reduces the time that memory is claimed, which should especially pay off with
-O
, as much more time will be spent in the LLVM optimizer. This might enable the usage of more CPU cores when building on systems with limited memory, as it reduces the chance of overlapping LDC processes with fat memory requirements.