Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally reallocate large+small buffer after each module #537

Merged
merged 5 commits into from
Nov 21, 2024

Conversation

jodavies
Copy link
Collaborator

@jodavies jodavies commented Jun 7, 2024

This is a test of a request from @msloechner , which makes FORM's resident set size better track its actual memory usage. This comes on top of the split allocations branch, since it only deals with the largest buffer. Since FORM allocates its buffers and keeps them for the whole run, the apparent memory usage remains constant after a "peak". If the OS is under memory pressure it will swap pages, but they might be pages that FORM actually isn't using any more.

It seems there are two ways to do it: free and re-allocate the buffer, or use madvise to specify MADV_DONTNEED. Both have the same effect and have about the same performance, but there is an impact of more than 10% compared to doing nothing. This shouldn't be done by default, and maybe never "every module". (However, the longer each module takes, with lots of disk access during sorting, etc, the less the performance impact will be).

The options could be, a preprocessor statement like #reallocatesort which must be specified when the user really wants this (say, after a heavy module) or maybe just On reallocatesort; to enable it for every module which follows. Any thoughts?

@msloechner , can you test this and see if it helps your multiple concurrent jobs scenario? I could also see it being useful when running FORM on cluster machines which kill jobs based on RSS: both of these commits have a lower peak RSS in my tests compared to the original behaviour.

Here is a small MINCER performance test for orig (unmodified form), split (checking the split allocations have no impact) realloc (the first commit) and madv (the second):

Benchmark 1: nice -n -10 ../bin/tform-test-SR-orig -w12 calcdia.frm > calcdia.log1
  Time (mean ± σ):     16.686 s ±  0.216 s    [User: 152.647 s, System: 1.269 s]
  Range (min … max):   16.252 s … 16.909 s    8 runs

Benchmark 2: nice -n -10 ../bin/tform-test-SR-split -w12 calcdia.frm > calcdia.log2
  Time (mean ± σ):     16.718 s ±  0.320 s    [User: 152.869 s, System: 1.197 s]
  Range (min … max):   16.166 s … 17.113 s    8 runs

Benchmark 3: nice -n -10 ../bin/tform-test-SR-realloc -w12 calcdia.frm > calcdia.log3
  Time (mean ± σ):     18.932 s ±  0.154 s    [User: 156.316 s, System: 6.020 s]
  Range (min … max):   18.730 s … 19.200 s    8 runs

Benchmark 4: nice -n -10 ../bin/tform-test-SR-madv -w12 calcdia.frm > calcdia.log4
  Time (mean ± σ):     18.942 s ±  0.248 s    [User: 157.591 s, System: 5.870 s]
  Range (min … max):   18.585 s … 19.308 s    8 runs

Summary
  nice -n -10 ../bin/tform-test-SR-orig -w12 calcdia.frm > calcdia.log1 ran
    1.00 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-split -w12 calcdia.frm > calcdia.log2
    1.13 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-realloc -w12 calcdia.frm > calcdia.log3
    1.14 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-madv -w12 calcdia.frm > calcdia.log4

Here is the RSS profile of this test:
rss

@jodavies jodavies marked this pull request as draft June 7, 2024 10:48
@msloechner
Copy link

Thanks a lot, @jodavies . This looks indeed promising. I will look into it when I find a bit of time.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 7, 2024

In case it is useful to see how your own scripts behave, I logged the RSS during the run with something like

../bin/tform-test-SR-split -w8 calcdia.frm > calcdia.log1 &
pid=$!
while kill -0 $pid; do
   ps --pid $pid -o rss=
   sleep 0.2
done > tform-rss-orig.dat

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 13, 2024

I added an On/Off switch for this, I think it is better than just a preprocessor instruction since the option can easily be enabled for multiple modules (nice when calling procedures etc) and can still be used for a "single module" (by turning it off again in the next).

I think the "free and reallocate" version is simpler in that we don't need to worry about aligned memory allocations and whether or not this works in Windows.

Edit: I don't know what is going on with valgrind on the runners these days, I see a lot of:

valgrind:  Fatal error at startup: a function redirection
valgrind:  which is mandatory for this platform-tool combination
valgrind:  cannot be set up.  Details of the redirection are:
valgrind:  
valgrind:  A must-be-redirected function
valgrind:  whose name matches the pattern:      strlen
valgrind:  in an object with soname matching:   ld-linux-x86-64.so.2
valgrind:  was not found whilst processing
valgrind:  symbols from the object with soname: ld-linux-x86-64.so.2

@coveralls
Copy link

Coverage Status

coverage: 49.981% (-0.02%) from 49.999%
when pulling 6a5c6d5 on jodavies:sort-realloc
into 83e3d41 on vermaseren:master.

@msloechner
Copy link

msloechner commented Jun 13, 2024

I've run an example for my use case, it looks quite suitable: I don't see any significant loss in performance in terms of CPU-time in my case, while benefiting from the reduction in resident set size.

For me this improvement will be mainly of importance in (single-threaded) form, where the time spent in single modules greatly exceeds the time needed for sorting and allocating memory, but which do not benefit from tform parallelisation too much. But I see the benefit also for tform as soon as it's able to combine a lot of single terms in a .sort.

Note that in the example below the ratio between the runtimes after and before the peak in memory consumption is rather low, in other cases (that would take too long to benchmark) the ratio might be around 5-10.
RSS

@msloechner
Copy link

@jodavies Maybe it would be great to also have a preprocessor command #sortreallocate to be active only for a single module. Could you possibly add this?

@jodavies
Copy link
Collaborator Author

Do you strongly prefer that to doing

On sortreallocate;
...
.sort
Off sortreallocate;

?

Of course it is straightforward to have both options, so I can add a #sortreallocate too.

@msloechner
Copy link

I think it would be a nice feature when you're dynamically generating code with the preprocessor and happen to not have access to the previous .sort statement (but don't want to sort right now). What do you think?

@jodavies jodavies force-pushed the sort-realloc branch 2 times, most recently from d3eaca4 to 5302a8d Compare June 13, 2024 20:45
@coveralls
Copy link

Coverage Status

coverage: 49.967% (-0.03%) from 49.999%
when pulling 5302a8d on jodavies:sort-realloc
into 83e3d41 on vermaseren:master.

@jodavies jodavies added this to the v5 milestone Nov 6, 2024
@coveralls
Copy link

coveralls commented Nov 7, 2024

Coverage Status

coverage: 50.25% (+0.02%) from 50.227%
when pulling 12fa261 on jodavies:sort-realloc
into 0289435 on vermaseren:master.

@jodavies jodavies changed the title Reallocate large+small buffer after each module Optionally reallocate large+small buffer after each module Nov 7, 2024
@jodavies
Copy link
Collaborator Author

jodavies commented Nov 8, 2024

In trying to test this further, it seems that actually it does not work in tform. The SortBlocks contain pointers into the master's lBuffer (set in IniSortBlocks) which need to be updated. I have no idea how my benchmarks worked before.

@msloechner if you happen to have your test code still around, are you by chance able to check if you used the reallocation mode or the (now deleted) madvise version?

@msloechner
Copy link

I just tried to run with the tform binaries I compiled in June for buffer reallocation. The tform buffer reallocation immediately aborts with a segfault.
Regarding the madvise version, I can't find the binary on my disk anymore and it seems like you removed any trace of it from github, so I'm no longer able to test it. But it must have been madvise that I got working back in the days with tform.

@jodavies
Copy link
Collaborator Author

Rebased on master (which now has the separated sort allocation). Merged the tform fix into the first commit, to avoid a broken commit in the master history.

@jodavies
Copy link
Collaborator Author

This branch just force-enables the reallocation for every module, just to see how it goes on the CI: https://github.com/jodavies/form/tree/sort-realloc-ci-test

The test for Issue508 fails, where we have an 8G ulimit, and a reallocation fails. Outside of valgrind it runs OK with the ulimit, so I think we can ignore this.

@jodavies jodavies marked this pull request as ready for review November 14, 2024 14:26
@jodavies jodavies requested a review from tueda November 14, 2024 14:27
doc/manual/prepro.tex Outdated Show resolved Hide resolved
sources/structs.h Outdated Show resolved Hide resolved
@tueda
Copy link
Collaborator

tueda commented Nov 15, 2024

Maybe I'll add a minor review comment to this PR today, but I'll probably review the core part tomorrow. (Right now, I'm hitting some minor VSCode C++ extension issues...)

sources/threads.c Outdated Show resolved Hide resolved
@tueda
Copy link
Collaborator

tueda commented Nov 16, 2024

The code looks fine.

But why did you choose #reallocatesort instead of ModuleOption reallocatesort for reallocation at the end of the module? To me, #reallocatesort sounds like an immediate reallocation (which would be fine during preprocessing).

@jodavies
Copy link
Collaborator Author

Not really any reason I suppose, it would also fit nicely as a moduleoption alongside "inparallel" and "noparallel", i.e. options which don't really have anything to do with the processing of the actual terms (like "local" etc) but more to do with the high-level operation of the module.

I am not opposed to adding a moduleoption (and removing the pre-proc command?) if you prefer.

@tueda
Copy link
Collaborator

tueda commented Nov 18, 2024

Actually, I don't have a strong opinion either :-)

ModuleOption sortreallocate seems like a good option, but prioritizing user experience is more important.
One disadvantage of using ModuleOption is that it has to be placed at the end of modules (see also #188).

I'm not entirely sure about the use case described in

I think it would be a nice feature when you're dynamically generating code with the preprocessor and happen to not have access to the previous .sort statement (but don't want to sort right now).

So, @msloechner, what do you think? Which syntax option would be easy to use?

The current implementation supports both

* reallocate buffer after the sort in this module.
On sortreallocate;
*
* heavy task here...
*
.sort
Off sortreallocate;

and

* reallocate buffer after the sort in this module.
#sortreallocate
*
* heavy task here...
*
.sort

Instead of the latter, we could implement

*
* heavy task here...
*
* reallocate buffer after the sort in this module.
ModuleOption sortreallocate;
.sort

and/or

*
* heavy task here...
*
.sort
* reallocate buffer immediately.
#sortreallocate

RSS decreases. If the OS is under memory pressure, it will not be
swapping out useless pages.

The SortBlocks contain pointers into the master thread's lBuffer, which
need updating after reallocation. Use an "UpdateSortBlocks" function to
do this, which is a trimmed-down version of IniSortBlocks which only
sets the pointers.
Enable the reallocation for a single module. If specified in the same
module as "Off sortreallocate;", the reallocation will still happen in
that module.
@jodavies
Copy link
Collaborator Author

Rebased and review addressed.

@msloechner
Copy link

@tueda: Very sorry for answering only now. I would prefer the first option, because the sortreallocation may only take place at a .sort or .store, and in Form statements like ModuleOption become effective at the end of the module.

@tueda
Copy link
Collaborator

tueda commented Nov 20, 2024

@msloechner Thanks for your input! So, by "the first option", you mean the current implementation (#sortreallocate for delayed reallocation)? (Or the first alternative ModuleOption sortreallocate?)

If the current one is the best, then I think this PR can be merged.

@msloechner
Copy link

By "first option" I meant On sortreallocate and #sortreallocate for delayed reallocation.

@jodavies jodavies merged commit 328e810 into vermaseren:master Nov 21, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants