Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally enable GC for front-end #2916

Merged
merged 6 commits into from
Mar 9, 2019
Merged

Optionally enable GC for front-end #2916

merged 6 commits into from
Mar 9, 2019

Conversation

kinke
Copy link
Member

@kinke kinke commented Nov 17, 2018

This isn't looking too bad, although druntime cannot be compiled yet:

Phobos all-at-once compilation, Win64:

GC: used bytes = 1707268112
GC: used bytes = 570342720 [after collection]

std.string unittests:

GC: used bytes = 1651332656
GC: used bytes = 361209920

std.algorithm.iteration unittests:

GC: used bytes = 1496288448
GC: used bytes = 644807264

std.regex.internal.tests unittests:

GC: used bytes = 2079689120
GC: used bytes = 292884624

It doesn't reduce the peak memory requirements (well, maybe a little bit if LLVM allocates significant amounts during codegen), but reduces the time that memory is claimed, which should especially pay off with -O, as much more time will be spent in the LLVM optimizer. This might enable the usage of more CPU cores when building on systems with limited memory, as it reduces the chance of overlapping LDC processes with fat memory requirements.

@kinke
Copy link
Member Author

kinke commented Nov 17, 2018

On Win64, I can now compile all default libs (incl. unittests) without crashing. I don't see the memory drop though in the VS tools (and neither in the task manager); adding an explicit GC.minimize() makes no difference...

@kinke
Copy link
Member Author

kinke commented Nov 17, 2018

I've also tried NOT disabling the GC at startup; that seems to work too, at least for 2 Phobos module unittests, and does reduce the memory footprint (std.string: from ~1.9 GB to ~1.1).

@dnadlinger
Copy link
Member

I don't think the GC (or regular free) would return pages to the OS.

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

GC.minimize() is supposed to do exactly that (according to the docs). Maybe the fragmentation is so high, in combination with the page size, so that there isn't a relevant number of unused pages to be freed.

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

Yeah I think it's the fragmentation, as in a few cases, GC.minimize() seems to have a (pretty small) effect:

> bin\ldc2 -c -unittest ..\ldc\runtime\phobos\std\regex\internal\tests.d
GC: 1983M used, 76M free, 2059M total
GC: 280M used, 1331M free, 1611M total

@rainers
Copy link
Contributor

rainers commented Nov 18, 2018

GC.minimize only returns memory to the OS if a complete pool (usually 4MB or larger IIRC) is unused, so very unlikely for "small" object pools. Might happen for "large" allocations (>2k) eventually.

@rainers
Copy link
Contributor

rainers commented Nov 18, 2018

Did you also measure impact on compile times? Last time I tried that it was some 10-20% IIRC.

My actual target was to compile phobos unittests all at once, but that still used more memory than my system was able to provide (16GB) and crashed at some point.

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

That's what I thought. Is the pool/page size configurable via cmdline switch?

I haven't checked the compile times yet; the single collect + minimize appears pretty cheap though, that's why I wanted to focus on that. Enabling the GC from the start is surely a different matter, but probably not that costly (relatively) with -O.

I also tried compiling the Phobos unittests all at once a while back, and IIRC, it was something like 16 GB (edit: or maybe 16-24, I really don't remember; my system has 32), and was also very slow compared to parallel separate compilation with my 4 cores.

@rainers
Copy link
Contributor

rainers commented Nov 18, 2018

That's what I thought. Is the pool/page size configurable via cmdline switch?

You can tweak it with --DRT-gcopt, but not below 1MB:

>dmd --DRT-gcopt=help
GC options are specified as whitespace separated assignments:
    disable:0|1    - start disabled (0)
    profile:0|1|2  - enable profiling with summary when terminating program (0)
    gc:conservative|manual - select gc implementation (default = conservative)

    initReserve:N  - initial memory to reserve in MB (0)
    minPoolSize:N  - initial and minimum pool size in MB (1)
    maxPoolSize:N  - maximum pool size in MB (64)
    incPoolSize:N  - pool size increment MB (3)
    heapSizeFactor:N - targeted heap size to used memory ratio (2)

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

Ah yeah nice. LDC doesn't use the D main args, but the druntime args, i.e., these DRT options aren't filtered out and so unusable with LDC yet.

Still, I tried embedding the option as global:

extern(C) __gshared string[] rt_options = [ "gcopt=maxPoolSize:1" ];

That has quite an effect for the regex unittests, reducing the total size from 1.6 GB (see above) with default settings to 0.8:

GC: 1984M used, 1M free, 1985M total
GC: 279M used, 534M free, 813M total

For the std.string unittests, the drop is still negligible:

GC: 1575M used, 1M free, 1576M total
GC: 344M used, 1126M free, 1470M total

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

It'd be interesting to know where all the temporary allocations come from. If it turns out to be mostly CTFE/the interpreter (just a guess), using a reference-counted scheme or some sort of separate GC domain for these parts would certainly help wrt. comparatively high memory requirements of D compilers.

While I'm not too much interested in lower memory requirements myself yet, it's a recurring problem for CI, limiting parallelization (e.g., I have to build serially for SemaphoreCI, as we only have ~2.4 GB of memory when using Docker). And If I had a Ryzen with 16 logical cores, I'd prefer not having to have something like 32 GB of memory to make use of all of them when compiling the unittests. I guess it's similar for Joakim's phone with 8 cores.

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

Oh wow. I've just tried additionally enabling the GC selectively during ctfeInterpret():

--- a/dmd/dinterpret.d
+++ b/dmd/dinterpret.d
@@ -54,6 +54,8 @@ import dmd.visitor;
  */
 public Expression ctfeInterpret(Expression e)
 {
+    import core.memory;
+
     if (e.op == TOK.error)
         return e;
     assert(e.type); // https://issues.dlang.org/show_bug.cgi?id=14642
@@ -61,6 +63,8 @@ public Expression ctfeInterpret(Expression e)
     if (e.type.ty == Terror)
         return new ErrorExp();

+    GC.enable();
+
     // This code is outside a function, but still needs to be compiled
     // (there are compiler-generated temporary variables such as __dollar).
     // However, this will only be run once and can then be discarded.
@@ -75,6 +79,8 @@ public Expression ctfeInterpret(Expression e)
     if (CTFEExp.isCantExp(result))
         result = new ErrorExp();

+    GC.disable();
+
     return result;
 }

Results, without gcopt=maxPoolSize:1:

> bin\ldc2 -c -unittest ..\ldc\runtime\phobos\std\regex\internal\tests.d
GC: 491M used, 41M free, 532M total
GC: 281M used, 251M free, 532M total

>bin\ldc2 -c -unittest ..\ldc\runtime\phobos\std\string.d
GC: 995M used, 40M free, 1035M total
GC: 345M used, 690M free, 1035M total

Additionally with gcopt=maxPoolSize:1 (seems rather costly in comparison):

bin\ldc2 -c -unittest ..\ldc\runtime\phobos\std\regex\internal\tests.d
GC: 491M used, 1M free, 492M total
GC: 281M used, 210M free, 491M total

C:\LDC\ninja-ldc>bin\ldc2 -c -unittest ..\ldc\runtime\phobos\std\string.d
GC: 631M used, 133M free, 764M total
GC: 346M used, 400M free, 746M total

This is a cut-down of the regex unittests front-end (OS) memory requirements by 75% (2 GB -> 0.5 GB).

@kinke
Copy link
Member Author

kinke commented Nov 18, 2018

Runtime-wise, selectively enabling the GC makes almost no difference compared to having it enabled all the time, while enabling it from the start yields a 10% lower total size for the std.string unittests.
After some rudimentary checks, enabling it from the start adds very roughly 10-25% to overall runtime without -O.
gcopt=maxPoolSize:4 makes the regex tests compilation run twice as long (with enabled GC), compared to the default setting (!).

@rainers
Copy link
Contributor

rainers commented Nov 19, 2018

This is a cut-down of the regex unittests front-end (OS) memory requirements by 75% (2 GB -> 0.5 GB).

Sounds promising.

Runtime-wise, selectively enabling the GC makes almost no difference compared to having it enabled all the time, while enabling it from the start yields a 10% lower total size for the std.string unittests.

I guess a collection that has been avoided while the GC is disabled will always be done with the next allocation during CTFE.

After some rudimentary checks, enabling it from the start adds very roughly 10-25% to overall runtime without -O.

That's consistent with what I remember.

gcopt=maxPoolSize:4 makes the regex tests compilation run twice as long (with enabled GC), compared to the default setting (!).

That's probably because it will cause more collections whenever the next pool is exhausted. At that size each collection will take considerable time. You can show some stats with gcopt profile:1.

@kinke
Copy link
Member Author

kinke commented Nov 25, 2018

Damn - just enabling version = GC in dmd.root.rmem (while keeping the GC disabled) doesn't just cost runtime performance, but also a significant memory overhead. I've compared the runtimes and memory requirements for ninja -j1 phobos2-ldc-debug (using /usr/bin/time -v):

LDC_WITH_GC=OFF:

	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.54
	Maximum resident set size (kbytes): 1407776

LDC_WITH_GC=ON:

	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:40.35
	Maximum resident set size (kbytes): 1890952

LDC_WITH_GC=ON -lowmem:

	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:44.84
	Maximum resident set size (kbytes): 1284764

So while the savings with -lowmem look quite nice compared to not using the new switch, it hardly pays off in this case compared to current master (23% runtime overhead for -9% memory requirements).

@rainers
Copy link
Contributor

rainers commented Nov 25, 2018

I guess a command line switch would have to select the allocation method GC/bump-pointer at runtime in rmem.d. Memory overhead is expected to be about 25% for the GC as it allocates in blocks that are a power of 2 (up to 8k).

Building phobos without unittests is probably not a good candidate for freeing memory as most data is still referenced in the AST. There have to be more intermediate results caused by CTFE, for example.

@kinke
Copy link
Member Author

kinke commented Nov 25, 2018

Building phobos without unittests is probably not a good candidate for freeing memory as most data is still referenced in the AST.

Well, for Win64 the actually referenced data after a final collection was 544 MB (one third of the allocated GC memory), see the first post.

@kinke
Copy link
Member Author

kinke commented Nov 25, 2018

I guess a command line switch would have to select the allocation method GC/bump-pointer at runtime in rmem.d.

The difficulty is that the non-GC version also overrides the allocation functions (_d_allocmemory etc.) at link-time. No idea how to optionally still call the original druntime functions.

Memory overhead is expected to be about 25% for the GC as it allocates in blocks that are a power of 2 (up to 8k).

In the Phobos all-at-once case it's 34% (edit: overall memory, incl. C++ heap, so it's even higher). Do you think the implicit align(16) (for 64-bit code) is another reason for the waste? Edit: Ah, the non-GC allocation function also pads to a multiple of 16 bytes (even for 32-bit).

@kinke
Copy link
Member Author

kinke commented Nov 25, 2018

What's apparently eating lots of memory is CTFE and templates. IIRC, building the ldc-build-runtime tool (300 lines, but importing Phobos templates) [edit: and ldc-prune-cache too, 100 lines) requires about twice as much memory as the full DMD frontend all-at-once.

@kinke
Copy link
Member Author

kinke commented Nov 25, 2018

as it allocates in blocks that are a power of 2 (up to 8k).

You mean for each instance? So if we have extern(C++) class C { size_t[4] blub; }, the size is padded from 8 (vptr) + 4x8 = 40 bytes to 64? And so from 72 bytes to 128, 136 to 256 etc.?

@kinke kinke changed the title WIP: Enable GC for front-end and collect once manually before codegen WIP: Optionally enable GC for front-end Nov 25, 2018
@kinke
Copy link
Member Author

kinke commented Dec 1, 2018

With Rainer's druntime PR above, this is beginning to look useful. Wall-clock + max resident set sizes (in MB and for the full process, not just GC memory) for some compilations (no optimizations, enabled LDC/LLVM assertions) on Linux x64, comparing LDC_WITH_GC=ON -lowmem + additional GC bins (new) against current master:

druntime all-at-once
	master:            0:07.22 /  198 MB 
	new:               0:07.99 /  204 MB (+10.7% / +3.0%)
	new hsf=1.5:       0:08.07 /  187 MB (+11.8% / -5.6%)
	new mps=1:         0:08.09 /  206 MB (+12.0% / +4.0%)
	new hsf=1.5 mps=1: 0:08.44 /  180 MB (+16.9% / -9.1%)
Phobos all-at-once
	master:            0:33.84 / 1374 MB 
	new:               0:39.43 / 1092 MB (+ 16.5% / -20.5%)
	new hsf=1.5:       0:40.88 /  863 MB (+ 20.8% / -37.2%)
	new mps=1:         0:57.78 /  834 MB (+ 70.7% / -39.3%)
	new hsf=1.5 mps=1: 1:13.42 /  787 MB (+117.0% / -42.7%)
std.regex.internal.tests unittests
	master:            0:22.35 / 1734 MB 
	new:               0:31.56 /  663 MB (+ 41.2% / -61.8%)
	new hsf=1.5:       1:17.76 /  624 MB (+247.9% / -64.0%)
	new mps=1:         1:38.90 /  619 MB (+342.5% / -64.3%)
	new hsf=1.5 mps=1: 2:36.28 /  555 MB (+599.2% / -68.0%)
std.string unittests
	master:            0:27.06 / 1491 MB 
	new:               0:31.82 /  913 MB (+17.6% / -38.8%)
	new hsf=1.5:       0:33.53 /  806 MB (+23.9% / -45.9%)
	new mps=1:         0:33.44 /  917 MB (+23.6% / -38.5%)
	new hsf=1.5 mps=1: 0:35.63 /  756 MB (+31.7% / -49.3%)
std.typecons unittests
	master:            0:20.02 / 1673 MB 
	new:               0:26.69 / 1447 MB (+33.3% / -13.5%)
	new hsf=1.5:       0:28.68 / 1336 MB (+43.3% / -20.1%)
	new mps=1:         0:29.98 / 1392 MB (+49.8% / -16.8%)
	new hsf=1.5 mps=1: 0:33.19 / 1307 MB (+65.8% / -21.9%)
std.range.package unittests
	master:            0:34.36 / 2023 MB 
	new:               0:43.30 / 1689 MB (+26.0% / -16.5%)
	new hsf=1.5:       0:45.29 / 1547 MB (+31.8% / -23.5%)
	new mps=1:         0:44.15 / 1700 MB (+28.5% / -16.0%)
	new hsf=1.5 mps=1: 0:49.74 / 1479 MB (+44.8% / -26.9%)
std.algorithm.searching unittests
	master:            0:21.00 / 1795 MB 
	new:               0:27.51 / 1236 MB (+31.0% / -31.1%)
	new hsf=1.5:       0:29.00 / 1100 MB (+38.1% / -38.7%)
	new mps=1:         0:29.72 / 1196 MB (+41.5% / -33.4%)
	new hsf=1.5 mps=1: 0:34.04 / 1122 MB (+62.1% / -37.5%)
std.container.rbtree unittests
	master:            0:29.56 / 1805 MB 
	new:               0:36.94 / 1380 MB (+25.0% / -23.5%)
	new hsf=1.5:       0:37.97 / 1464 MB (+28.5% / -18.9%)
	new mps=1:         0:40.57 / 1408 MB (+37.2% / -22.0%)
	new hsf=1.5 mps=1: 0:43.32 / 1342 MB (+46.5% / -25.7%)

The memory requirements can most likely be further lowered by tweaking the GC params.
Edit: I added measurements for experimental runs with a heapSizeFactor of 1.5 (default: 2.0).
Edit2: And with maxPoolSize=1 (default: 64).

@kinke
Copy link
Member Author

kinke commented Dec 2, 2018

I got rid of the CMake option and now switch at runtime; it's now either running with an enabled GC (-lowmem) or keeping on running with a disabled GC + bump-pointer allocation scheme (faster, less memory overhead over a disabled GC alone). The additional runtime overhead for the 2 modes isn't really measurable.

@rainers
Copy link
Contributor

rainers commented Dec 4, 2018

Looks good. It seems the default GC options strike the best balance between speed and memory size. I wonder how much the enforced collection before termination costs (https://github.com/dlang/druntime/blob/master/src/gc/proxy.d#L94). As it is unnecessary for 99% of all programs I think it should be hidden behind a GC option.

@kinke
Copy link
Member Author

kinke commented Dec 4, 2018

Yeah the defaults seem reasonable; playing with a slightly lower heapSizeFactor (something like 1.67 / 1.75) might pay off in some cases.

I guess the one more collection at shutdown probably doesn't really matter here; the finalization cost is probably very low for the front-end (Array dtors, nothing else coming to mind immediately). Hiding it may be a good idea; I'm wondering what that 'popular demand' for reenabling it was. ;)

@kinke
Copy link
Member Author

kinke commented Dec 7, 2018

@rainers: Do you think it'd be worth it trying to upstream the 2 modes at runtime (& -lowmem switch or whatever), getting rid of compile-time version (GC)?

@rainers
Copy link
Contributor

rainers commented Dec 8, 2018

Do you think it'd be worth it trying to upstream the 2 modes at runtime (& -lowmem switch or whatever), getting rid of compile-time version (GC)?

I'd approve, but it might get some opposition from nano-cycle-counting people (that seem fine with calling the virtual function ti.initializer multiple times). Maybe it can still be versioned, but integrated with the other code, i.e. version(GC) if(!isGCDisabled) {}. Will it still work with GC and OVERRIDE_MEMALLOC=0?

2 Observations:

  • xstrdup now crashes for s==null
  • !isGCDisabled is a double negative used a lot, I'd prefer isGCEnabled

@kinke
Copy link
Member Author

kinke commented Dec 8, 2018

but it might get some opposition from nano-cycle-counting people

That's what I was afraid of ;) - but yeah, keeping version (GC) with different semantics shouldn't make much of a difference.

Will it still work with GC and OVERRIDE_MEMALLOC=0?

I think it should; OVERRIDE_MEMALLOC should only matter for a disabled GC, decreasing runtime and memory requirements.

xstrdup now crashes for s==null

Yep; I took the version (GC) code as baseline, which doesn't check for null, but only later figured that that code isn't used at the moment. I'll change it to the safe variant, although apparently not really required.

!isGCDisabled is a double negative used a lot, I'd prefer isGCEnabled

Makes sense.

@kinke
Copy link
Member Author

kinke commented Mar 2, 2019

Rebased (2.085 druntime features the GC improvements wrt. less memory overhead) and resynced with the upstream PR, incl. -lowmem tests for all targets and --DRT-* CLI support.

@kinke kinke changed the title WIP: Optionally enable GC for front-end Optionally enable GC for front-end Mar 2, 2019
@kinke kinke changed the title Optionally enable GC for front-end WIP: Optionally enable GC for front-end Mar 2, 2019
@kinke
Copy link
Member Author

kinke commented Mar 3, 2019

@rainers: Wow, I'm now getting much better timings for hsf=1.5 mps=1. Timings on Win64 with this PR:

  • ldc2 -unittest -c ..\ldc\runtime\phobos\std\regex\internal\tests.d: 8.99 secs
  • ldc2 -unittest -c ..\ldc\runtime\phobos\std\regex\internal\tests.d -lowmem "--DRT-gcopt=heapSizeFactor:1.5 maxPoolSize:1": 17.44 secs (+94%)

My previous test (see above) was on Linux (and using an rt_options global), where it ran 7x longer than with bump-ptr (!). The memory savings are unchanged (~1,700 MB -> 560 MB).

Can this huge speed-up (3.5x) be attributed to your 2.085 GC improvements? Edit: That module was split in 2 with 2.085.

@kinke
Copy link
Member Author

kinke commented Mar 3, 2019

For those regex tests, I also see a consistent ~3% runtime improvement when switching from bump-ptr to a simple disabled GC without cleanup (-lowmem "--DRT-gcopt=disable:1 cleanup:none"), at the cost of ~6% higher memory requirements.

@kinke kinke changed the title WIP: Optionally enable GC for front-end Optionally enable GC for front-end Mar 3, 2019
@rainers
Copy link
Contributor

rainers commented Mar 3, 2019

The only explicit performance improvements was for sweeping large allocations (with maybe a slight effect on allocation, too), but I doubt there will be many large allocations (>2kB) by dmd.

For those regex tests, I also see a consistent ~3% runtime improvement when switching from bump-ptr to a simple disabled GC without cleanup (-lowmem "--DRT-gcopt=disable:1 cleanup:none")

Not sure why it can be faster, but it's similar to my observation with compiling phobos not showing any difference within the accuracy of measurements. Maybe it's because the GC malloc/free functions are faster than C malloc/free for Mem.xmalloc/xfree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants