Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better errors when (Sub)SmallExtension is filled. #513

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

jodavies
Copy link
Collaborator

@jodavies jodavies commented May 4, 2024

Previously, FORM warns only about SmallExtension, even if it is a sub-buffer sort.

This prompts the user to adjust the wrong buffer in the setup.

Fixes #512

@jodavies
Copy link
Collaborator Author

jodavies commented May 4, 2024

It looks like parform hangs after printing the error here.

check/fixes.frm Show resolved Hide resolved
check/fixes.frm Show resolved Hide resolved
@coveralls
Copy link

coveralls commented May 7, 2024

Coverage Status

coverage: 50.296% (+0.07%) from 50.227%
when pulling 4411553 on jodavies:full-subsmallextension
into 0289435 on vermaseren:master.

@tueda
Copy link
Collaborator

tueda commented May 7, 2024

Hmm...For some reason tform on macos-11 doesn't print "Please increase SubSmallExtension setup parameter." before terminating.

@jodavies
Copy link
Collaborator Author

jodavies commented May 7, 2024

I have no way to test if it crashes for some other reason.

@tueda
Copy link
Collaborator

tueda commented May 11, 2024

This may be irrelevant, but on MacOS (Sonoma 14.4.1, Apple M2, Apple clang version 15.0.0), I got some malloc errors for Issue512_2:

./build/sources/vorm -D TEST=Issue512_2 check/fixes.frm 
FORM 5.0.0-beta.1 (May  4 2024, v5.0.0-beta.1-57-g1a75289)  Run: Sat May 11 17:47:37 2024 
    #ifndef `TEST' 
      #message Use -D TEST=XXX 
      #terminate 
    #else 
      #include `NAME_' # `TEST' 
    #- 
 
Time =       0.00 sec    Generated terms =          1 
            test         Terms in output =          1 
                         Bytes used      =         20 
Please increase SubSmallExtension setup parameter. 
Program terminating at check/fixes.frm Line 2722 -->  
  0.12 sec out of 0.13 sec 
vorm(25084,0x1fcafbac0) malloc: *** error for object 0x300000001: pointer being freed was not allocated 
vorm(25084,0x1fcafbac0) malloc: *** set a breakpoint in malloc_error_break to debug 

The corresponding Valgrind run on Linux is:

valgrind ./build/sources/vorm -D TEST=Issue512_2 check/fixes.frm
FORM 5.0.0-beta.1 (May  4 2024, v5.0.0-beta.1-57-g1a75289)  Run: Sat May 11 17:55:02 2024
...
Time =       0.05 sec    Generated terms =          1
            test         Terms in output =          1
                         Bytes used      =         20
==118620== Invalid write of size 4
==118620==    at 0x20C263: GarbHand (sort.c:3599)
==118620==    by 0x20CA39: AddPoly (sort.c:2199)
==118620==    by 0x20CD2F: SplitMerge (sort.c:3345)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBBB: SplitMerge (sort.c:3331)
==118620==    by 0x20CBBB: SplitMerge (sort.c:3331)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20E7EB: StoreTerm (sort.c:4440)
==118620==    by 0x1DF55F: Generator (proces.c:3366)
==118620==  Address 0x4bf8a5b8 is 0 bytes after a block of size 22,857,080 alloc'd
==118620==    at 0x4849839: malloc (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==118620==    by 0x22C03F: Malloc1 (tools.c:2322)
==118620==    by 0x206335: AllocSort (setfile.c:916)
==118620==    by 0x209261: NewSort (sort.c:641)
==118620==    by 0x126FDB: CoIdExpression (comexpr.c:824)
==118620==    by 0x12765A: CoId (comexpr.c:398)
==118620==    by 0x13ADF2: CompileStatement (compiler.c:658)
==118620==    by 0x1D76E1: PreProcessor (pre.c:1123)
==118620==    by 0x214BAE: main (startup.c:1690)
==118620==
Please increase SubSmallExtension setup parameter.
Program terminating at check/fixes.frm Line 2722 -->
==118620== Invalid read of size 4
==118620==    at 0x21443B: Terminate (startup.c:1792)
==118620==    by 0x4BF78A6F: ???
==118620==    by 0x20CA86: AddPoly (sort.c:2249)
==118620==    by 0x20CD2F: SplitMerge (sort.c:3345)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20CBBB: SplitMerge (sort.c:3331)
==118620==    by 0x20CBBB: SplitMerge (sort.c:3331)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==    by 0x20E7EB: StoreTerm (sort.c:4440)
==118620==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==118620==
  6.78 sec out of 6.79 sec
==118620== Invalid free() / delete / delete[] / realloc()
==118620==    at 0x484C89F: free (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==118620==    by 0x22C0B0: M_free (tools.c:2449)
==118620==    by 0x1404D8: ClearSortGZIP (compress.c:731)
==118620==    by 0x20FCDC: CleanUpSort (sort.c:4737)
==118620==    by 0x2140EB: CleanUp (startup.c:1710)
==118620==    by 0x21440E: Terminate (startup.c:1850)
==118620==    by 0x214A53: onErrSig (startup.c:1558)
==118620==    by 0x4D8E51F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==118620==    by 0x21443A: Terminate (startup.c:1790)
==118620==    by 0x4BF78A6F: ???
==118620==    by 0x20CA86: AddPoly (sort.c:2249)
==118620==    by 0x20CD2F: SplitMerge (sort.c:3345)
==118620==  Address 0x1f00000098 is not stack'd, malloc'd or (recently) free'd
==118620==
==118620== Invalid free() / delete / delete[] / realloc()
==118620==    at 0x484C89F: free (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==118620==    by 0x22C0B0: M_free (tools.c:2449)
==118620==    by 0x1404E8: ClearSortGZIP (compress.c:732)
==118620==    by 0x20FCDC: CleanUpSort (sort.c:4737)
==118620==    by 0x2140EB: CleanUp (startup.c:1710)
==118620==    by 0x21440E: Terminate (startup.c:1850)
==118620==    by 0x214A53: onErrSig (startup.c:1558)
==118620==    by 0x4D8E51F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==118620==    by 0x21443A: Terminate (startup.c:1790)
==118620==    by 0x4BF78A6F: ???
==118620==    by 0x20CA86: AddPoly (sort.c:2249)
==118620==    by 0x20CD2F: SplitMerge (sort.c:3345)
==118620==  Address 0x41fffffff0 is not stack'd, malloc'd or (recently) free'd
==118620==
==118620== Syscall param unlink(pathname) points to unaddressable byte(s)
==118620==    at 0x4E620AB: unlink (syscall-template.S:120)
==118620==    by 0x4DADE6C: remove (remove.c:33)
==118620==    by 0x20FCF4: CleanUpSort (sort.c:4741)
==118620==    by 0x2140EB: CleanUp (startup.c:1710)
==118620==    by 0x21440E: Terminate (startup.c:1850)
==118620==    by 0x214A53: onErrSig (startup.c:1558)
==118620==    by 0x4D8E51F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==118620==    by 0x21443A: Terminate (startup.c:1790)
==118620==    by 0x4BF78A6F: ???
==118620==    by 0x20CA86: AddPoly (sort.c:2249)
==118620==    by 0x20CD2F: SplitMerge (sort.c:3345)
==118620==    by 0x20CBA2: SplitMerge (sort.c:3330)
==118620==  Address 0x5 is not stack'd, malloc'd or (recently) free'd

Something wrong seems to be happening before calling Terminate()...

@jodavies
Copy link
Collaborator Author

I'll have a look at it. It might be related to #468 (comment)

@jodavies
Copy link
Collaborator Author

The valgrind error here is due to GarbHand making no attempt to avoid writing out of bounds. The old content of the small buffer, plus the stuff in SplitScratch, is copied back to the small buffer here:

form/sources/sort.c

Lines 3581 to 3611 in 0112aa6

s2 = S->sBuffer;
t = garbuf;
Point = S->sPointer;
k = S->sTerms;
while ( --k >= 0 ) {
if ( *Point ) {
*Point++ = s2;
i = *t;
NCOPY(s2,t,i);
}
else { Point++; }
}
Point = AN.SplitScratch;
k = AN.InScratch;
while ( --k >= 0 ) {
if ( *Point ) {
*Point++ = s2;
i = *t;
NCOPY(s2,t,i);
}
else Point++;
}
S->sFill = s2;
#ifdef TESTGARB
MLOCK(ErrorMessageLock);
MesPrint("out: S->sFill = %x, S->sTop2 = %x",S->sFill,S->sTop2);
if ( S->sFill >= S->sTop2 ) {
MesPrint("We are in deep trouble");
}
MUNLOCK(ErrorMessageLock);
#endif

The function certainly knows things have gone wrong; if you define TESTGARB it prints "We are in deep trouble", and form is certainly going to Terminate after return. But this can be made "valgrind clean" by comparing s2+i with sTop2 in each iteration of the while loops (maybe just the second: the original content of the small buffer should certainly fit again), before the NCOPY.

That we get a valgrind error here means that the content of the old small buffer + SplitScratch has blown through the small extension, and also through the PObuffer, which follows the small buffer in the allocation of AllocSort.

@jodavies
Copy link
Collaborator Author

(maybe just the second: the original content of the small buffer should certainly fit again)

This is actually not the case here; it is the first loop which causes the problem. I don't understand why, though. The general idea is,

  1. In AddCoef and AddPoly, try to put a new term into the small buffer
  2. If it doesn't fit within the SmallExtension (sTop2), call GarbHand
  3. Check total size of existing sPointer and SplitScratch terms. If they fit in the spare space in the LargeBuffer use that, otherwise allocate space
  4. Copy the terms of sPointer and SplitScratch to the temporary space, skipping holes.
  5. Copy terms back to the SmallBuffer+Extension, (again skipping holes?)

I would expect that the original terms pointed to by sPointer should be guaranteed to fit in the SmallBuffer+Extension?

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 3, 2024

I added some prints to try to work out what is going on here. For a "working" call of GarbHand, I get something like the following:

sort.c AddPoly 2196, make new term (S->sFill 932624) (new term 7106) --> 939730
sort.c AddPoly 2196, make new term (S->sFill 939730) (new term 9192) --> 948922
sort.c AddPoly 2196, make new term (S->sFill 948922) (new term 12614) --> 961536
------Garbage collection-------
in:  S->sFill = 948922, S->sTop2 = 960008
total = 926812, nterms = 0, SmallSize = 640004, SmallESize = 960008
out:  S->sFill = 926812, S->sTop2 = 960008

so as soon as sFill plus the size of the new term is >= sTop2, GarbHand is called which clears out the gaps in the small buffer (this is all during a SplitMerge so terms are being merged a lot).

At the problematic call:

sort.c AddPoly 2196, make new term (S->sFill 959886) (new term 72) --> 959958
sort.c AddPoly 2196, make new term (S->sFill 959958) (new term 48) --> 960006
sort.c AddPoly 2196, make new term (S->sFill 960006) (new term 48) --> 960054
------Garbage collection-------
in:  S->sFill = 960006, S->sTop2 = 960008
total = 982988, nterms = 0, SmallSize = 640004, SmallESize = 960008
==96593== Invalid write of size 4
==96593==    at 0x221EB0: GarbHand (sort.c:3597)
==96593==    by 0x22277E: AddPoly (sort.c:2206)
==96593==    by 0x222B07: SplitMerge (sort.c:3321)

Here "total" is larger than sTop2 (I changed the print compared to the master branch, which prints 2*total for some reason). But sFill was 960006 before the call, and it was only trying to add a term of 48 words. So I am not sure how a "total" of 982988 is possible, unless the sPointer array contains terms unaccounted for in sFill -- as I understand, this should not be the case.

I am testing on top of #529 so writes over sTop2 are an immediate crash, unlike in the master branch which needs a write beyond PObuffer also (the test rebased onto master actually doesn't cause a valgrind error anymore, due to the increased value of SIOsize from #515).

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 3, 2024

The issue is that "NEWSPLITMERGE" is not strict about setting pointers to zero, for terms which have merged (or cancel, but in this test nothing cancels). If we use the old code, GarbHand correctly computes the size of the terms which remain.

Maybe this was done for speed, but what is means is that the "new" code fails in some cases where the old succeeds because GarbHand cannot work properly. If you run Issue512_2 with a SubSmallExtension of 5090K, it will run to completion with the old code, but crash with the new because it determines (incorrectly) that SubSmallExtension is too small.. The Timsort code is commented for this tests, but it doesn't change anything.

@jodavies jodavies force-pushed the full-subsmallextension branch from e679123 to 710d340 Compare June 4, 2024 15:18
@jodavies
Copy link
Collaborator Author

jodavies commented Jun 4, 2024

@tueda can you test this on your mac?

In case it is needed again, I used the following print code (and diff) in the two versions of SplitMerge, to make sure the pointer array now looks the same in both cases.

// Print current state
printf("%4ld: ", number);
for (int ii = 0; ii < S->sTerms; ii++) {
       if ( (S->sPointer)[ii] ) {
               printf("X");
       }
       else {
               printf(".");
       }
}
printf("\n");
fflush(stdout);

@tueda
Copy link
Collaborator

tueda commented Jun 5, 2024

Now it works on my MacBook Air.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 5, 2024

OK, that is good. I have run a few mincer benchmarks and did not see any performance impact. I think in FORM the term comparison is so expensive that no amount of pointer operations during the sort is noticable.

I have a "re-written" version of SplitMerge with some variable name changes and a bunch of commentary to make sure I understand it. Maybe the "improvement in the style of Timsort" can be further improved; at the moment it checks the right-hand half of the left array for the merge point of the right array with a binary search. In principle one could check the entire left array for the merge point, instead of just giving up immediately if the terms are not ordered at the midpoint. Maybe Jos already tested this at the time, and found it not to help? I will test this, but it would be for a new PR.

@jodavies jodavies added this to the v4.3.2 milestone Nov 6, 2024
Previously, FORM warns only about SmallExtension, even if it is a sub-buffer sort.

This prompts the user to adjust the wrong buffer in the setup.
@jodavies jodavies force-pushed the full-subsmallextension branch from 710d340 to 96b9485 Compare November 15, 2024 09:04
@jodavies
Copy link
Collaborator Author

I missed some pointer zeroing which was causing GarbHand to still double-count terms. The crash was due to the merge of the split sort allocation, so now errors here are overruns of the small extension buffer (this passed the CI before, where the overruns were just spilling into the PO buffer).

The "Timsort" part needed to be modified slightly, when there are not two copies of the terms present, but it is functionally the same.

I will do some performance regression checks later.

@jodavies
Copy link
Collaborator Author

I don't see any performance impact due to this compared to master, and indeed I can see a ~1% improvement due to the Timsort code (I could not measure a difference there in the past). I'll merge the fix commit into the previous so that there isn't a buggy commit on the master branch.

Commit f1b83ae removed various zeroed pointers from SplitMerge, which
does not produce wrong results, since they are "out of range" of the
sorted and merged region of the pointer array, but it leads to GarbHand
computing the total size of the terms incorrectly.

This causes valgrind errors when we write over the end of the SmallExtension,
and causes termination due to a "too small SmallExtension" which might
not actually be the case.
@jodavies jodavies force-pushed the full-subsmallextension branch from 7de7e57 to c9d5abd Compare November 18, 2024 10:31
@jodavies
Copy link
Collaborator Author

This should be good to merge now I think.

@tueda
Copy link
Collaborator

tueda commented Nov 18, 2024

Before merging, can you remove warnings in the test cases added by this PR? You can see them by running:

./check/check.rb ./build/sources/vorm 'Issue512*' -v

These warning messages are not caught by the test suite for now (#581) but should be resolved.

@jodavies
Copy link
Collaborator Author

Yes, this is just a case of using the buffer sizes that form uses anyway.

I could also add a "debug mode" for splitmerge which I used to test the changes (hidden behind a preproc variable) which could be useful in the future?

@tueda
Copy link
Collaborator

tueda commented Nov 18, 2024

You mean debug code disabled by a C preprocessor #ifdef directive? Yes, that would be useful.

@jodavies
Copy link
Collaborator Author

Before merging, can you remove warnings in the test cases added by this PR? You can see them by running:

./check/check.rb ./build/sources/vorm 'Issue512*' -v

These warning messages are not caught by the test suite for now (#581) but should be resolved.

To do this the test needs to be ~4x larger. Otherwise if I use buffers like

#: SmallSize 2580K
#: SmallExtension 3870K
#: TermsInSmall 2M

#: SubSmallSize 2580K
#: SubSmallExtension 3870K
#: SubTermsInSmall 2M

the tform workers allocate 1/4 of these (for eg) and then are scaled back up to the same numbers, and the warning still prints. Then the tform valgrind tests take about 1min-3min each on my machine.

The test for Issue468 also gives the warnings.

@tueda
Copy link
Collaborator

tueda commented Nov 18, 2024

So, the problem is that the buffer size should be adjusted for the number of workers, right?

One possible (and tedious) solution is to make several test cases for, e.g., form & parform, tform -w2 and tform -3, by setting #require serial? || mpi?, #require threaded? && ncpu == 2 and #require threaded? && ncpu == 4.

A more sophisticated solution would be to introduce a kind of template

#: SmallSize #{some_Ruby_expression}K

but it requires some work in check.rb.

@jodavies
Copy link
Collaborator Author

jodavies commented Nov 18, 2024

So, the problem is that the buffer size should be adjusted for the number of workers, right?

Yes and no. With sizes per-test-config, the single-threaded valgrind runs could be made smaller and faster, but the -w4 test is going to be the bottleneck regardless, if the test is configured to test the intended thing for tform.

Optionally print the state of the pointer array on entry to SplitMerge,
for ease of potential future debugging here.
@tueda
Copy link
Collaborator

tueda commented Nov 18, 2024

Maybe we have to allow such slow test cases. To allow tests that take more time than usual (10 seconds for non-Valgrind and 300 seconds for Valgrind), we can use, e.g., #time_dilation 2.0.

@jodavies
Copy link
Collaborator Author

OK, then I will get the proper tests in and see how they run in the CI. It could run a bit faster in general, by splitting the tests into a larger number of groups -- usually the CI is waiting for just a few (<20, the limit) tests to finish for a while.

Make sure the tests run properly for tform -w4 also. (This makes the
tests larger and more time consuming, also for form).

Also fix the buffers for the Issue468 test.
@jodavies
Copy link
Collaborator Author

Now if you define SPLITMERGEDEBUG in sort.c, you will get an output which shows the pointers into the small buffer on entry to SplitMerge.

FORM 5.0.0-beta.1 (Nov 18 2024, v5.0.0-beta.1-85-g4411553-dirty)  Run: Mon Nov 18 15:25:27 2024
    Symbol x;

    Local test = (1+x)^10;
    .sort
  11:    0    4   12   20   28   36   44   52   60   68   76
   5:    0    4   12   20   28   36   44   52   60   68   76
   2:    0    4   12   20   28   36   44   52   60   68   76
   3:    0    4   12   20   28   36   44   52   60   68   76
   1:    0    4   12   20   28   36   44   52   60   68   76
   2:    0    4   12   20   28   36   44   52   60   68   76
   6:    0    4   12   20   28   36   44   52   60   68   76
   3:    0    4   12   20   28   36   44   52   60   68   76
   1:    0    4   12   20   28   36   44   52   60   68   76
   2:    0    4   12   20   28   36   44   52   60   68   76
   3:    0    4   12   20   28   36   44   52   60   68   76
   1:    0    4   12   20   28   36   44   52   60   68   76
   2:    0    4   12   20   28   36   44   52   60   68   76

Time =       0.00 sec    Generated terms =         11
            test         Terms in output =         11
                         Bytes used      =        268

    Identify x = -1;
   1:    0

    Print;
    .end
  11:    0    4    8   12   16   20   24   28   32   36   40
   5:    0    4    8   12   16   20   24   28   32   36   40
   2:    0    4    8   12   16   20   24   28   32   36   40
   3:    0 ....    8   12   16   20   24   28   32   36   40
   1:    0 ....    8   12   16   20   24   28   32   36   40
   2:    0 ....    8   12   16   20   24   28   32   36   40
   6:    0 .... .... .... ....   20   24   28   32   36   40
   3:    0 .... .... .... ....   20   24   28   32   36   40
   1:    0 .... .... .... ....   20   24   28   32   36   40
   2:    0 .... .... .... ....   20   24   28   32   36   40
   3:    0 .... .... .... ....   20 .... ....   32   36   40
   1:    0 .... .... .... ....   20 .... ....   32   36   40
   2:    0 .... .... .... ....   20 .... ....   32   36   40

Time =       0.00 sec    Generated terms =         11
            test         Terms in output =          0
                         Bytes used      =          4

   test = 0;

  0.00 sec out of 0.00 sec

This produces huge output for larger problems, but it allows you do diff the logs and make sure things are going OK. You should get the same output, regardless of your NEWSPLITMERGE or NEWSPLITMERGETIMSORT settings.

In the master branch if you put the same printing code, you will see for the Issue512 test cases (best not run with tform...) that the array contains some non-zero pointers in the NEWSPLITMERGE mode, compared to the original code. It is these non-zero pointers which were causing the issues in GarbHand.

@jodavies
Copy link
Collaborator Author

@tueda if you are happy with this I would like to merge it, a collaborator has smallextension problems with some jobs and this should fix it for them.

@tueda
Copy link
Collaborator

tueda commented Nov 25, 2024

Yes, I think this can be merged. I'll merge it.

@tueda tueda merged commit fc829af into vermaseren:master Nov 25, 2024
70 checks passed
@jodavies
Copy link
Collaborator Author

Thanks. This one should go into 4.3 as well, if you are not already on it.

@tueda
Copy link
Collaborator

tueda commented Nov 25, 2024

Yes, I'm trying to resolve conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bad error message for full SubSmallExtension
3 participants