New PERL_STRLEN_NEW_MIN & PERL_ARRAY_NEW_MIN_KEY definitions #20529

richardleach · 2022-11-19T23:17:54Z

This PR consists of two commits, both concerned with the
hardcoded default minimum length for PV buffers and AV arrays.

Extracts the hardcoded numbers into two new definitions:

        PERL_ARRAY_NEW_MIN_KEY
        PERL_STRLEN_NEW_MIN

Increases the value of PERL_STRLEN_NEW_MIN to be
closer to the likely actual allocation returned by malloc().

The intention is to reduce the potential for unnecessary
calls to sv_grow, such as when building a string through
concatenation, when the PV buffer is actually big enough
for the operation and perl just does not know it.

The minimum value chosen for 64-bit is very likely too
small, but I am unfamiliar with memory-constrained
systems and wanted to play it safe. (Happy to raise it to
the very common 3*8 bytes if people are comfortable
with that.)

Actually measuring the minimum allocation size at
Configure time, for malloc() implementations that support
it, would be more accurate and I might try that in the next
development cycle.

The PERL_UNWARANTED_CHUMMINESS_WITH_MALLOC
stuff would also help with this problem, but I've previously been
told that it was disabled due to problems with COW that have
not been ironed out yet.

In a nutshell, for a long time the minimum PV length (hardcoded in Perl_sv_grow) has been 10 bytes and the minimum AV array size (hardcoded in av_extend_guts) has been 4 elements. These numbers have been used elsewhere for consistency (e.g. Perl_sv_grow_fresh) in the past couple of development cycles. Having a standard definition, rather than hardcoding in multiple places, is more maintainable. This commit therefore introduces into perl.h: PERL_ARRAY_NEW_MIN_KEY PERL_STRLEN_NEW_MIN (Note: Subsequent commit(s) will actually change the values.)

Major malloc implementations, including the popular dlmalloc derivatives all return chunks of memory that are a multiple of the platform's pointer size. Perl's traditional default string allocation of 10 bytes will almost certainly result in a larger allocation than requested. Consequently, the interpreter may try to Renew() an allocation to increase the PV buffer size when it does not actually need to do so. This commit increases the default string size to the nearest pointer multiple. (12 bytes for 32-bit pointers, 16 bytes for 64-bit pointers). This is almost certainly unnecessarily small for 64-bit platforms, since most common malloc implementations seem to return 3*pointer size (i.e. 24 bytes) as the smallest allocation. However, 16 bytes was chosen to prevent an increase in memory usage in memory-constrained platforms which might have a smaller minimum memory allocation.

demerphq · 2022-11-20T09:54:02Z

comment replaced. see below.

demerphq · 2022-11-20T10:05:04Z

This PR consists of two commits, both concerned with the
The PERL_UNWARANTED_CHUMMINESS_WITH_MALLOC
stuff would also help with this problem, but I've previously been
told that it was disabled due to problems with COW that have
not been ironed out yet.

I vaguely remember that. I believe it is related to sv_gets(). IIRC we end
up massively over-allocating when we use it. We need to be super careful
with adjusting some of these things, they can have surprising and
unintended action at a distance on some of our code.

If you look for a thread titled "sv_grow() and malloc" you will find a
discussion by Dave M and me about this. He provided the following summary
which I am quoting below verbatim, his summary of the work i did to fix
sv_gets() is better than any I did myself I think, and he did a lot of
additional research on the subject:

Sayeth @iabyn (Dave Mitchel): in https://marc.info/?l=perl5-porters&m=141147301828156&w=3

I've spent the last week or two (on and off) looking into how all the
malloc and COW stuff interacts.

TL;DR: COW caused problems with memory wastage when using readline, so
sv_setsv_flags() was modified to skip COW if SvLEN was quite bigger than
SvCUR; this then caused problems on platforms with malloc_size() (Darwin
and perl's malloc as far as I know), so calling malloc_size() was
disabled. I believe that the SvCUR verses SvLEN test is sub-optimal in
some cases, and can be tweaked, after which malloc_size() can be
re-enabled. Further, I believe a run-time startup probe of malloc() and
realloc() behaviour on platforms without malloc_size() can provide much of
the benefit of malloc_size().

Background:

There are two core functions I'm interested in here: sv_setsv_flags(),
which for strings, decides when to copy, and when to COW; and sv_grow(),
which mallocs or reallocs the string buffer when the SV is first used as a
string, and when it is subsequently needed to hold a bigger string.

sv_grow() applies various fiddle factors to the requested length; these
come in two classes; first those where we think we're smarter than the
caller (e.g. sometimes over-allocating by typically 25% to avoid lots of
reallocs if the string is being repeatedly appended to), and second where
we start being chummy with the malloc library. The first class I think
can be improved, but I'm going to leave that discussion for another time.

As regards chumminess, some platforms have a malloc_size() function which
returns, for a malloc()ed address, the real size allocated, which is likely
to be larger than the requested size due to alignment, implementation and
efficiency considerations. Where available, perl uses this function to
post-hoc adjust SvLEN. On platforms without malloc_size(), we instead
round the request up to sizeof(size_t), on the grounds that the malloc
library is likely to do the same. So overall, sv_grow(sv, len) looks
something like:

    len += adjust for various adjusts;

    #ifdef Perl_safesysmalloc_size
        s = malloc(len);
        SvPV_set(sv, s);
        SvLEN_set(sv, Perl_safesysmalloc_size(s));
    #else
        len = round_up(len);
        SvPV_set(sv, malloc(len));
        SvLEN_set(sv, len);
    #endif

The other function of interest, sv_setsv_flags() (which is the thing
called for perl-level scalar assignment), has a set of criteria to decide
when to copy the string, when to steal it, and when to share it (COW). For
example, if the string is relatively short and the destination string
already has a buffer big enough to hold it, then it's cheaper to just copy
the string rather than set up COW.

The issues:

A while back we had an issue with COW and readline() interacting:
code like this:

    push @a, $_ while (<>);

was causing each pushed SV to be large, even if the lines were short. This
was due to sv_gets() initially allocating a large buffer to read the first
line; rather than a copy being pushed, the original large SV is COWed and
pushed; when readline read the second line, it noticed that the buffer was
shared, abandoned it, and created a new (large) buffer. It's generally
agreed that the whole readline()/sv_gets() buffer allocation strategy
needs rethinking anyway, but in the meantime, and also to help avoid
similar situations elsewhere, Yves added to 5.20.0, with commit
e8c6a47, some extra criteria to sv_setsv_flags() on when to skip
COW: namely when the src's SvLEN is quite a bit bigger than its SvCUR,
(for some definition of "quite".) This solved the readline() issue.

This subsequently caused an issue on Darwin (and any platform that has
malloc_size()):

    [perl #121975] COW speedup lost after e8c6a474

In that ticket, something like:

    $x = "x" x 1_000_000;
    $y = $x for 1..1_000_000;

was slow (took minutes) up to and including 5.18.0; it then became fast
(< 1 sec) when COW was enabled; it became slow again when the SvCUR verses
SvLEN checks were added. This was because perl was doing malloc(1000002),
and Darwin's malloc() library was rounding this to a nice block, 1003520
(0xf4242 => 0xf5000). When we set SvLEN to malloc_size(), we ended up with
SvCUR = 100000, SvLEN = 1003520; the difference between the two was enough
to trigger the "skip COW" test. Yves worked round this by disabling calls
to malloc_size() by default, with ce861ea:

temporary fix for [perl #121975] COW speedup lost after e8c6a474

What I propose:

I think that the criteria for skipping COW on divergent SvCUR/SvLEN
need tweaking; I'll discuss this further below. Once that's fixed,
I think it will be safe to revert ce861ea, and platforms that have
malloc_size() will again benefit from this slight optimisation.

Further, based on the outputs of that little program I asked people
to run on various platforms, I think it is possible at perl startup time,
to do a probe consisting approximately of

    malloc(1); realloc($_) for 2..N

which will allow us to determine two constants A and B such that the size
in any call to malloc() or realloc() in sv_grow() can be rounded up to to
A+nB for some n with no extra memory wasted (for example on 64-bit linux,
A=24, B=16). This would replace the existing "round up to sizeof(size_t)"
on non-malloc_size() platforms (equivalent to A=8,B=8). We'd continue to
use malloc_size() on platforms that support it. It may also be the case
that there are other platform-specific malloc() introspection functions
that we could add Configure probes for and use where available. But I'll
want to think further about the A+nB and probing issues and discuss this
later.

Finally, there are issues with sv_grow()'s "I know better than the caller"
over-allocation strategies; which I want to fix, but again I'll discuss
this further later.

What now:

So the main thing I want to do right now is tweak the SvCUR verses SvLEN
testing, then revert the malloc_size() disabling. The three tables below
show how sv_setsv_flags() decides whether to steal, copy or COW based on
various values of SvCUR of the source string, plus various other criteria
(some of which I have not mentioned since they are irrelevant to this
discussion, such as the cow refcount being > SV_COW_REFCNT_MAX).

The main action shown in each table is that which prevailed before the
e8c6a47 additional "SvLEN >> SvCUR" restrictions; the comments in
[square brackets] describe the additional restrictions following that
commit, while the numbers in {} are examples of my proposed changes to the
criteria (see later more details).

The main pre-patch criterion was to copy rather than COW if the string was
short (< 1250 bytes) and would already fit in the destination buffer (and
the buffer already existed); the added post-patch criteria were to skip
COW (and thus favour copy) if SvLEN > 2*SvCUR or SvLEN > SvCUR + 80.

First, test whether the buffer is swipeable:

    just steal the buffer from the src SV if:
        src not COW, RC==1, and various other constraints,
        and the SV is either SVs_TEMP or SVs_PADTMP:

     src:CUR  SVs_TEMP    SVs_PADTMP
         ---  --------    ----------
           1     STEAL    DONT-STEAL
           2     STEAL    DONT-STEAL
           3     STEAL    DONT-STEAL
          ..
          78     STEAL    DONT-STEAL
          79     STEAL    DONT-STEAL
          80     STEAL    DONT-STEAL
          81     STEAL    DONT-STEAL
          82     STEAL    DONT-STEAL
          83     STEAL    DONT-STEAL
         ...
        1249     STEAL    DONT-STEAL
        1250     STEAL    STEAL [if src:LEN < 1330   {1562} ]
        1251     STEAL    STEAL [if src:LEN < 1331   {1563} ]
        1252     STEAL    STEAL [if src:LEN < 1333   {1565} ]
        ....
     1000000     STEAL    STEAL [if src:LEN < 100080 {1250000} ]
     1000001     STEAL    STEAL [if src:LEN < 100081 {1250001} ]

If not swipeable, then test for COWability. There are two sets of
criteria, based on whether the src SV is already marked as COW or not
(there is a greater cost in upgrading it to COW)

If the src is already COW:

    COW if:
       src:LEN == 0 (shared hash key)
    || (src:CUR >= 1250 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
    || dst:LEN < src:CUR+1

     src:CUR
         ---
           1     COW if wont fit in dst buf
           2     COW if wont fit in dst buf
           3     COW if wont fit in dst buf
          ..
          78     COW if wont fit in dst buf
          79     COW if wont fit in dst buf
          80     COW if wont fit in dst buf
          81     COW if wont fit in dst buf
          82     COW if wont fit in dst buf
          83     COW if wont fit in dst buf
         ...
        1249     COW if wont fit in dst buf
        1250     COW always [ if src:LEN < 1330    {1562}    ]
        1251     COW always [ if src:LEN < 1331    {1563}    ]
        1252     COW always [ if src:LEN < 1332    {1565}    ]
        ....
     1000000     COW always [ if src:LEN < 1000080 {1250000} ]
     1000001     COW always [ if src:LEN < 1000081 {1250001} ]

If the src is not already COW:

    COW if:
       src SV is capable of being converted to COW
          (inc 1 spare byte available for refcnt)
    && (src:CUR >= 0 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
    && (   (src:CUR >= 1250 && src:LEN-src:CUR < 80 && src:LEN < src:CUR*2)
        || dst:LEN < src:CUR+1
       )

     src:CUR
         ---
           1     COW if wont fit in dst buf [ and if src:LEN <   2  {24}]
           2     COW if wont fit in dst buf [ and if src:LEN <   4  {24}]
           3     COW if wont fit in dst buf [ and if src:LEN <   6  {24}]
          ..
          78     COW if wont fit in dst buf [ and if src:LEN < 156  {97}]
          79     COW if wont fit in dst buf [ and if src:LEN < 158  {98}]
          80     COW if wont fit in dst buf [ and if src:LEN < 160  {100}]
          81     COW if wont fit in dst buf [ and if src:LEN < 161  {101}]
          82     COW if wont fit in dst buf [ and if src:LEN < 162  {102}]
          83     COW if wont fit in dst buf [ and if src:LEN < 163  {103}]
         ...
        1249     COW if wont fit in dst buf [ and if src:LEN < 1329 {1561}]
        1250     COW always                 [     if src:LEN < 1330 {1562}]
        1251     COW always                 [     if src:LEN < 1331 {1563}]
        1252     COW always                 [     if src:LEN < 1332 {1565}]
        ....
     1000000     COW always                 [     if src:LEN < 1000080 {1250000} ]
     1000001     COW always                 [     if src:LEN < 1000081 {1250001} ]

As can be seen, for large strings the SvLEN < SvCUR + 80 test dominates,
while for small strings, the existing "< 1250" test and/or the new SvLEN < SvCUR * 2 test dominate.

I think this is basically the wrong way round; I'm not sure if this was
intentional or a coding error (the && should have been a || maybe??). I
think for large strings, the test should be SvLEN < SvCUR * N for some
small constant N like 1.25, (but with N big enough such that
malloc_size() rounding up won't cause issues), while for small strings,
the SvLEN < SvCUR + M test should dominate, but I think M should be
changed to something smaller than 80. Ideally, we should round up SvLEN
using the A+nB values found from startup probing, and use that as the
limit.

So together, I propose that the criteria should be something like:
    skip COW if:
           SvLEN >= SvCUR * 1.25
        && round_up(SvLEN) >= SvCUR

With Linux/glibc/64-bit where A=24 and B=16, this would yield thresholds
that are marked with {NNN} in the tables above.

So in conclusion, I am proposing for discussion right now,
a) re-enabling malloc_size() where available, and
probing at runtime startup to determine constants A and B where mallocs
will be of the form A+nB;
b) adjusting the SvCUR verses SvLEN algorithm as suggested above.
I'm not totally wedded to my exact suggested criteria 2 paragraphs
above; I'm open to suggestions.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

richardleach · 2022-11-20T22:20:53Z

Thanks, @demerphq. I'd remembered Dave talking about runtime probing but couldn't unearth the correct discussion. That thread seemed to end on (FC's email?):

Dave Mitchell wrote:
> So in conclusion, I am proposing for discussion right now,
> a) re-enabling malloc_size() where available, and
>    probing at runtime startup to determine constants A and B where mallocs
>    will be of the form A+nB;
> b) adjusting the SvCUR verses SvLEN algorithm as suggested above.
>    I'm not totally wedded to my exact suggested criteria 2 paragraphs
>    above; I'm open to suggestions.

I agree with both.

Thoughts:

So (a) hasn't been done yet, but has (b)?
I'm unclear what the A and B constants would be. A is the smallest possible allocation? If B is some kind of scaling constant, it could break down if, as someone suggested in the thread, the malloc() has multiple different sizing strategies.
I suppose just for the point of this PR, we'd only need to determine that A, and it seems like this could be done independently of the scaling question.
Could we do the runtime test prior to runtime and bake the constants in to the interpreter, rather than having to e.g. read them out of an interpreter variable? e.g. Could miniperl do the probe and adjust a definition then used to build perl?

I also saw this, which seemed like an interesting approach: https://doc.qt.io/qt-5/containers.html#growth-strategies

Independent of all this, do you think this PR could be applied as-is, with us revisiting the rest as a follow up?

demerphq · 2022-11-21T11:57:19Z

@richardleach Yes, sorry, my mail was just explaining why this stuff is sensitive. It doesn't change the picture much except for replacing the hard coded constants with defines, with is an obvious improvement. I don't think the change in min string size makes much of a difference. So I approved and merged the patch.

I think further discussion on this should go into an issue where it can be tracked independently. FWIW, as an issue for another ticket: hashes are presized to 8 buckets, why should arrays be any different? Should hashes be reduced or arrays expanded? Food for though for a future ticket.

iabyn · 2022-11-22T12:40:38Z

On Sun, Nov 20, 2022 at 02:21:04PM -0800, Richard Leach wrote: * I'm unclear what the `A` and `B` constants would be. `A` is the smallest possible allocation? If `B` is some kind of scaling constant, it could break down if, as someone suggested in the thread, the` malloc()` has multiple different sizing strategies.

Well that was 8 years ago and I've I forgotten much; but I think the idea was based on the observation that malloc() will almost certainly round up small allocations to some sort of block size, e.g. 8 bytes. So malloc(2) followed by realloc(8) will likely re-use the same address. But further, the first block it uses may contain some malloc housekeeping overhead, meaning there is less space to use for the first block. So for example if two bytes are used for overhead, then the first block can only contain up to 6 bytes. So in this context, A=6, B=8 and blocks are filled on a A+Bn basis for n=1,2,.... Or to put it another way, whenever perl's string allocating code wants to do a malloc for a block sized anywhere from 1..6, it should do a malloc(6) and set SvLEN to 6. Similarly, for lengths of 7..14 it should malloc(14), 15..22, malloc(22) and so on. Of course this won't hold for large block sizes. But the idea being that for systems which don't have a malloc_size()-type introspection facility, perl could, by doing some basic probing at startup, do a better job of guessing what sorts of rounding-up of malloc request sizes than it does at the moment. I don't think I ever acted upon any of the suggestions in that thread. It got put on my back-burner and then forgotten about. The whole COW things needs some careful love and tweaking. In particular, IIRC, COW strings being freed or normalized (e.g.having an integer assigned to them) currently take a very slow code path. COW being set triggers SV_CHECK_THINKFIRST(), which tends to then take a slow code path checking for lots of things like SvREADONLY, SvOOK() etc. See the 'case SAVEt_CLEARSV' branch in Perl_leave_scope() for example. Finally, here's a trivial example of a malloc probe. A real one would be less crude, and would increment i by greater than 1 in later stages. #include <stdio.h> #include <malloc.h> typedef long unsigned int UV; typedef unsigned char U8; int main(int argc, char**argv) { void *m, *r, *throwaway; int i, prev_i; /* the throwaway mallocs are to encourage the allocated block to be * hemmed in, necessitating realloc() using new memory */ throwaway = malloc(1); m = malloc(1); throwaway = malloc(1); prev_i = 1; for (i = 2; i <= 2048; i++) { r = realloc(m, i); //printf("i=%3d; r=%p\n", i, r); if (r != m) { printf("at size %3d realloc used new buffer, diff=%3d\n", i, i-prev_i); m = r; prev_i = i; throwaway = malloc(1); } } } On my Linux system this output the following. It shows that for up to about 1K, A=24 bytes, B=16 bytes. $ ./c at size 25 realloc used new buffer, diff= 24 at size 41 realloc used new buffer, diff= 16 at size 57 realloc used new buffer, diff= 16 at size 73 realloc used new buffer, diff= 16 at size 89 realloc used new buffer, diff= 16 at size 105 realloc used new buffer, diff= 16 at size 121 realloc used new buffer, diff= 16 at size 137 realloc used new buffer, diff= 16 at size 153 realloc used new buffer, diff= 16 at size 169 realloc used new buffer, diff= 16 at size 185 realloc used new buffer, diff= 16 at size 201 realloc used new buffer, diff= 16 at size 217 realloc used new buffer, diff= 16 at size 233 realloc used new buffer, diff= 16 at size 249 realloc used new buffer, diff= 16 at size 265 realloc used new buffer, diff= 16 at size 281 realloc used new buffer, diff= 16 at size 297 realloc used new buffer, diff= 16 at size 313 realloc used new buffer, diff= 16 at size 329 realloc used new buffer, diff= 16 at size 345 realloc used new buffer, diff= 16 at size 361 realloc used new buffer, diff= 16 at size 377 realloc used new buffer, diff= 16 at size 393 realloc used new buffer, diff= 16 at size 409 realloc used new buffer, diff= 16 at size 425 realloc used new buffer, diff= 16 at size 441 realloc used new buffer, diff= 16 at size 457 realloc used new buffer, diff= 16 at size 473 realloc used new buffer, diff= 16 at size 489 realloc used new buffer, diff= 16 at size 505 realloc used new buffer, diff= 16 at size 521 realloc used new buffer, diff= 16 at size 537 realloc used new buffer, diff= 16 at size 553 realloc used new buffer, diff= 16 at size 569 realloc used new buffer, diff= 16 at size 585 realloc used new buffer, diff= 16 at size 601 realloc used new buffer, diff= 16 at size 617 realloc used new buffer, diff= 16 at size 633 realloc used new buffer, diff= 16 at size 649 realloc used new buffer, diff= 16 at size 665 realloc used new buffer, diff= 16 at size 681 realloc used new buffer, diff= 16 at size 697 realloc used new buffer, diff= 16 at size 713 realloc used new buffer, diff= 16 at size 729 realloc used new buffer, diff= 16 at size 745 realloc used new buffer, diff= 16 at size 761 realloc used new buffer, diff= 16 at size 777 realloc used new buffer, diff= 16 at size 793 realloc used new buffer, diff= 16 at size 809 realloc used new buffer, diff= 16 at size 825 realloc used new buffer, diff= 16 at size 841 realloc used new buffer, diff= 16 at size 857 realloc used new buffer, diff= 16 at size 873 realloc used new buffer, diff= 16 at size 889 realloc used new buffer, diff= 16 at size 905 realloc used new buffer, diff= 16 at size 921 realloc used new buffer, diff= 16 at size 937 realloc used new buffer, diff= 16 at size 953 realloc used new buffer, diff= 16 at size 969 realloc used new buffer, diff= 16 at size 985 realloc used new buffer, diff= 16 at size 1001 realloc used new buffer, diff= 16 at size 1017 realloc used new buffer, diff= 16 at size 1033 realloc used new buffer, diff= 16 at size 1049 realloc used new buffer, diff= 16

…

-- A power surge on the Bridge is rapidly and correctly diagnosed as a faulty capacitor by the highly-trained and competent engineering staff. -- Things That Never Happen in "Star Trek" #9

richardleach added 2 commits November 8, 2022 23:40

demerphq approved these changes Nov 21, 2022

View reviewed changes

demerphq merged commit 06dbfff into Perl:blead Nov 21, 2022

richardleach deleted the hydahy/min_sizes branch September 1, 2023 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New PERL_STRLEN_NEW_MIN & PERL_ARRAY_NEW_MIN_KEY definitions #20529

New PERL_STRLEN_NEW_MIN & PERL_ARRAY_NEW_MIN_KEY definitions #20529

richardleach commented Nov 19, 2022

demerphq commented Nov 20, 2022 via email •

edited

Loading

demerphq commented Nov 20, 2022 •

edited

Loading

richardleach commented Nov 20, 2022

demerphq commented Nov 21, 2022

iabyn commented Nov 22, 2022 via email

New PERL_STRLEN_NEW_MIN & PERL_ARRAY_NEW_MIN_KEY definitions #20529

New PERL_STRLEN_NEW_MIN & PERL_ARRAY_NEW_MIN_KEY definitions #20529

Conversation

richardleach commented Nov 19, 2022

demerphq commented Nov 20, 2022 via email • edited Loading

demerphq commented Nov 20, 2022 • edited Loading

richardleach commented Nov 20, 2022

demerphq commented Nov 21, 2022

iabyn commented Nov 22, 2022 via email

demerphq commented Nov 20, 2022 via email •

edited

Loading

demerphq commented Nov 20, 2022 •

edited

Loading