Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip errors #95

Closed
jodavies opened this issue May 20, 2016 · 38 comments · Fixed by #593
Closed

gzip errors #95

jodavies opened this issue May 20, 2016 · 38 comments · Fixed by #593
Labels
bug Something isn't working

Comments

@jodavies
Copy link
Collaborator

Several times, I have seen errors like

1FillInputGZIP: Error in gzip handling of input.
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 124F80 bytes
Program terminating in thread 1 at replace Line 18 -->

They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).

I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?

Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.

@vermaseren
Copy link
Owner

Hi Josh,

For now it does not ring a bell.
Problem with retrying is that compressing/decompressing is something with memory. That means you
cannot jump into the middle, and you can also not just retry. You would have to start from the ‘beginning’
again. This could be possible only if the compression would be arranged as separate entities and not
as one complete object for a whole patch. Unless of course the error occurs at the start of a patch.

How often has this occurred to you?

There is a still undocumented feature (I have not yet tested it. It was built by Ali Mirsoleimani) to use
BZIP2 instead of GZIP. This could give shorter output and depending of the level it could also be
a little bit faster (definitely compared to GZIP n with n >6). If you try this, let me know the results.

Jos

On 20 mei 2016, at 12:23, jodavies [email protected] wrote:

Several times, I have seen errors like

1FillInputGZIP: Error in gzip handling of input.
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 124F80 bytes
Program terminating in thread 1 at replace Line 18 -->
They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).

I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?

Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #95

@jodavies
Copy link
Collaborator Author

I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.

It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.

Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).

@vermaseren
Copy link
Owner

Hi Josh,

gzip compression is only used for the sort files. In the large buffer only the standard form compression
is used which is something rather simple.

Jos

On 20 mei 2016, at 12:54, jodavies [email protected] wrote:

I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.

It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.

Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).


You are receiving this because you commented.
Reply to this email directly or view it on GitHub #95 (comment)

@jodavies
Copy link
Collaborator Author

I have also seen, twice, errors like

Time =  108303.46 sec    Generated terms =   71246936
             d5d90522070 Terms left      =   70733861
     dotrewrite-d89-0-25 Bytes used      =15614237544
Read error in SetScratch
Program terminating in thread 3 at replace Line 18 -->

It is possible that these are real hardware errors. I will try to investigate whether this is the case.

Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.

I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.

Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.

* 192GB
#: MaxTermSize 300K
#: WorkSpace 400M
#: LargeSize 44G
#: SmallSize 6G
#: ScratchSize 16G
#: HideSize 8G
#: TermsInSmall 50M
#: LargePatches 16384
#: FilePatches 16384

@vermaseren
Copy link
Owner

Hi Josh,

A good setting for TermsInSmall is SmallSize/(typical term size).
In your crash that is in form compressed form about 200 bytes. In reality in the small buffer the terms are
not compressed and hence about 400 bytes is more realistic. 6Gbytes/50Mbytes is 120, hence there should
be no problem there.
It is possible that the above is a very rare error in FORM of course. I have had some of that in the past
that only in some very special cases something went wrong and only when things were very big.
And then, when this happens with tform, it may occur only once out of many runs.
This is very hard to debug as you can imagine.
The setup parameters look rather fine to me. If the ratio between LargeSize and SmallSize is not very
large, you may not need so many LargePatches, unless sorting the small buffer gives consistently a very
large collapse in the number of terms. This is different with the file patches. But indeed, it is important to
avoid stage4, because that is never fast (I have been in stage5 only once, but it worked).

If you can find out more about these crashes, maybe we can sooner or later make them deterministic and
have a chance to find the cause.

Jos

On 24 mei 2016, at 13:04, jodavies [email protected] wrote:

I have also seen, twice, errors like

Time = 108303.46 sec Generated terms = 71246936
d5d90522070 Terms left = 70733861
dotrewrite-d89-0-25 Bytes used =15614237544
Read error in SetScratch
Program terminating in thread 3 at replace Line 18 -->
It is possible that these are real hardware errors. I will try to investigate whether this is the case.

Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.

I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.

Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.

  • 192GB
    #: MaxTermSize 300K
    #: WorkSpace 400M
    #: LargeSize 44G
    #: SmallSize 6G
    #: ScratchSize 16G
    #: HideSize 8G
    #: TermsInSmall 50M
    #: LargePatches 16384
    #: FilePatches 16384

    You are receiving this because you commented.
    Reply to this email directly or view it on GitHub gzip errors #95 (comment)

@burp
Copy link

burp commented May 26, 2016

I have exactly the same issue (with default form settings), and since it is reproducible for me I did a bisection to track it down from a good 4.1 tag and a bad head to: " Fixed the compress/gzip/hide bug." 43a5b1e. Hope this helps!

Cheers, Tobias

@jodavies
Copy link
Collaborator Author

@burp , are you able to provide a script for your reproducible example, if it is small and crashes quickly?

I have a calculation which has crashed the last 4 times I have run it, but it produces 2TB of scratch files and takes more than a day to crash.

@burp
Copy link

burp commented May 30, 2016

It crashes within 1-2 minutes, that's why I was able to do a quick bisection. Unfortunately it is part of a bigger setup with tons of includes etc. I will try to boil it down to one simple script/file.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 9, 2016

@burp , were you able to create a simple example which crashes like this in the end?

I have had this crash on a second machine now, making genuine disk errors less likely.

Should we also have && fi->zsp != 0 in the if statement of line 1266, sort.c, in the function Sflush ?

@burp
Copy link

burp commented Jun 13, 2016

@joedavies: Have not had time for it yet, maybe I can just send you the setup in private. You will have to adjust a few include paths etc., but it's probably simpler for me than trying to minimize the reproducible example.

@jodavies
Copy link
Collaborator Author

That works, if it is OK with you!

Probably it is best if you can send Jos a copy also, he is much more likely to be able to find possible bugs.

@jodavies
Copy link
Collaborator Author

I had a quick look at this example. With GZIPDEBUG enabled at various places, during normal operation, one sees output like the following,

Preparing z-stream 0 with compression 1
...
Preparing z-stream 4 with compression 1

-+Reading 160000 bytes in stream 0 at position          0; stop at    5320619
Want to read in stream 0 at position     160000
--Reading 160000 bytes in stream 0 at position     160000
...
Closing stream 0
--Reading 160000 bytes in stream 1 at position    5480619
...
Closing stream 1
--Reading 160000 bytes in stream 2 at position   10310142
...
...
Closing stream 4
etc

In the example which crashes, we find

Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 160000 bytes in stream 0 at position          0; stop at    5486806
--Reading 160000 bytes in stream 0 at position     160000
...
Closing stream 0
  -Last words: 1 1 2 -3 0
 zerror = -2 in stream 0. At position    5486806

It seems like FORM is trying to read from stream 0, which was already "closed"?

The offending call of PutIn() is the one below the label NextTerm: in sort.c. It seems ki has the value 0 when it should be 1?

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 16, 2017

This crash does not occur if one changes MaxTermSize (and thus the buffersize passed to FillInputGZIP).

The crash scenario seems to be:

  • finish a stream (print "Closing stream 0")
  • call PutIn (from sort.c:4068) and then FillInputGZIP with the same stream number
  • enter if branch in compress.c:490
  • enter else branch in compress.c:497. Obtain value toread = 0.
  • attempt to inflate at compress.c:545. Get zerror = -2.

EDIT: there are no valgrind errors

@tueda
Copy link
Collaborator

tueda commented Jun 16, 2017

@jodavies Is there any public test code to reproduce the bug?

@jodavies
Copy link
Collaborator Author

I am using the code from @burp , of which I think Jos also has a copy. I have not been able to reduce it to a particularly minimal crashing example.

@jodavies
Copy link
Collaborator Author

jodavies commented Sep 21, 2017

I have made a nice example which demonstrates this crash. The script is

#-
* Try to find something which causes the gzip crash. It seems to be due to some
* tricky combination of number of patches or buffer sizes etc.
* We make an expression with an increasing number of terms until it crashes.

* Buffer settings, smaller than default. Hopefully we can hit the error with smaller expressions?
* Divide everything by 8
#: filepatches 32
#: largepatches 32
#: largesize 6250000
#: maxtermsize 1250
#: smallsize 1250000
#: smallextension 2500000
#: termsinsmall 12500

Off Statistics;

Symbol x,n;

#message terms = `NTERMS'
Local test = sum_(n,1,`NTERMS',x^n);
.sort

* Read and write terms
.sort

* Check all terms present
Identify x^n?pos_ = n;
.sort
Local test = test - `NTERMS'*(`NTERMS'+1)/2;

Print;
.end

I run this as an array task on a cluster, to scan over NTERMS. I grep the log file for "test = 0;" and keep a copy of failures.

I am using the current git version of form, with GZIPDEBUG enabled in sort.c and compress.c.
These examples crash with 4.2.0 and 4.1 also.
EDIT: an old FORM 4.0 binary on our network does not crash, I don't know if FORM 4.0 supported GZIP or not?

I have found the following crashing examples in the range 250K to 500K:

266906
279406
291906
304406
316906
329406
341906
354406
366906
379406
391906
404406
416906
429406
441906
454406
466906
479406
491906

(generated by 254406 + 12500 n)
(same setup, maxtermsize = 1300: same as above)
(same setup, termsinsmall = 25000: crashes at 254246 + 12500 n)

with output like

FORM 4.2.0 (Sep 21 2017, v4.2.0-16-g480a787-dirty) 64-bits  Run: Thu Sep 21 14:31:21 2017
    #-
~~~terms = 266906
 MergePatches created output file /formswap/1286300.266000.moon2/xformxxx.sor
Writing 100000 bytes at          0: -81 -31 -81 -29 111
Writing 100000 bytes at     100000: -105 -16 -105 -15 87
Writing 100000 bytes at     200000: -113 -65 -128 -1 15
Writing 100000 bytes at     300000: -4 105 -4 25 -4
Writing 100000 bytes at     400000
Writing 96024 bytes at     500000
   Last bytes written: 63 1 11 -116 70
     Perceived position in FlushOutputGZIP is     503976
Writing 66334 bytes at     503976
   Last bytes written: 0 -45 -119 81 -114
     Perceived position in FlushOutputGZIP is     537642
 EndSort: fPatchN = 2, lPatch = 2, position =       537642
 fPatchesStop[0] =     503976
 fPatchesStop[1] =     537642
Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 100000 bytes in stream 0 at position          0; stop at     503976
 read: 100000 +Last bytes read: -81 -31 -81 -29 111 in /formswap/1286300.266000
.moon2/xformxxx.sor, newpos =     100000
-+Reading 33666 bytes in stream 1 at position     503976; stop at     537642
 read: 33666 +Last bytes read: 0 -45 -119 81 -114 in /formswap/1286300.266000.m
oon2/xformxxx.sor, newpos =     537642
Want to read in stream 0 at position     100000
--Reading 100000 bytes in stream 0 at position     100000
   Last bytes read: -105 -16 -105 -15 87
Want to read in stream 0 at position     200000
--Reading 100000 bytes in stream 0 at position     200000
   Last bytes read: -113 -65 -128 -1 15
Want to read in stream 0 at position     300000
--Reading 100000 bytes in stream 0 at position     300000
   Last bytes read: -4 105 -4 25 -4
Want to read in stream 0 at position     400000
--Reading 100000 bytes in stream 0 at position     400000
   Last bytes read: -2 4 -2 36 -2
Want to read in stream 0 at position     500000
--Reading 3976 bytes in stream 0 at position     500000
   Last bytes read: 63 1 11 -116 70
 zerror = 1 in stream 0. At position     503976
Closing stream 0
  -Last words: 250240 1 1 3 0
 zerror = 1 in stream 1. At position     537642
Closing stream 1
  -Last words: 266906 1 1 3 0
 zerror = -2 in stream 1. At position     537642
FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes
Program terminating at test.frm Line 21 --> 
  0.34 sec out of 0.43 sec
 CleanUpSort removed file /formswap/1286300.266000.moon2/xformxxx.sor

I hope these "small-ish" and quickly running examples are useful.

@jodavies
Copy link
Collaborator Author

Changing >= S->pStop[ki] to > S->pStop[ki] at

if ( !par && ( (poin[k] + im + COMPINC) >= S->pStop[ki] )
appears to fix my example in the post above. It doesn't, however, fix @burp's example. Making the same change at
if ( !par && (poin[ul] + im + COMPINC) >= S->pStop[ki]
also does not fix it.

Perhaps it fixes my example "by accident"...

@jodavies
Copy link
Collaborator Author

jodavies commented Sep 26, 2017

I have just tried running this script with a form binary compiled --without-zlib. (Inspired by today's post in the forum).

Again, I ran for NTERMS values 250k-500k. Now, form crashes like

test.frm Line 18 --> Warning: gzip compression not supported on this platform
~~~terms = 262751
 EndSort: lPatch = 20, MaxPatches = 32,lFill = 7F7E1A95BD28, sSpace = 75069d, M
axTer = 1250, lTop = 7F7E1A97EFA8
 MergePatches created output file /formswap/davies/xformxxx.sor
 EndSort: fPatchN = 1, lPatch = 0, position =      6005772
 EndSort+: fPatchN = 2, lPatch = 0, position =      6306040
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort
EndSort: sortfile /formswap/davies/xformxxx.sor removed
Program terminating at test.frm Line 34 --> 
  0.07 sec out of 0.07 sec

for EVERY value of NTERMS in the range [255823,262752].

As before, there are no valgrind errors before the crash.

If I use a binary with or without zlib, but run with Off compress;, I find no crashing examples.

@tueda tueda added the bug Something isn't working label Nov 1, 2017
@tueda tueda mentioned this issue Nov 9, 2017
@tueda
Copy link
Collaborator

tueda commented Dec 21, 2017

The commit 26793e4 is expected to fix the error like

FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes

which was caused by calling inflateEnd() twice.

@jodavies
Copy link
Collaborator Author

Excellent! I have re-tested the script in #95 (comment) , and everything seems OK for NTERMS in the range 250K to 500K.

Also @burp 's example now runs without crashing.

The test of #95 (comment) still fails.

I tested #214 also: still fails.

@tueda
Copy link
Collaborator

tueda commented Dec 22, 2017

Thanks for the testing. Nice to hear that the GZIP crash is gone.

The crash with --without-zlib may be another bug.

@tueda
Copy link
Collaborator

tueda commented Dec 22, 2017

I just tried to reduce numbers in #95 (comment). The following example crashes at N=323 if I use a 64-bit executable configured with --without-zlib:

#:filepatches        4
#:largesize      25600
#:maxtermsize      200
#:smallsize      12800
#:termsinsmall      16

#do N=320,330
  #message N=`N'
  S x,k;
  L F = sum_(k,1,`N',x^k);
* size: (8 + 6 * (N - 1) + 1) * 4 = 24 * N + 12 (bytes)
  .sort
  Drop;
  L CheckZero = F - {`N'*(`N'+1)/2};
  id x^k?pos_ = k;
  P;
  .sort
  Drop;
  .sort
#enddo

.end
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort

Somehow I had Valgrind errors:

==18866== Memcheck, a memory error detector
==18866== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18866== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==18866== Command: ./vorm test
==18866== 
FORM 4.2.0 (Dec 15 2017, v4.2.0-29-g26793e4) 64-bits  Run: Fri Dec 22 19:01:37 2017

...

Time =       0.29 sec    Generated terms =        323
               F       1 Terms left      =        323
                         Bytes used      =       8004

Time =       0.29 sec
               F         Terms active    =        323
                         Bytes used      =       7692
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DE9FF: Compare1 (sort.c:2547)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DF518: Compare1 (sort.c:2552)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DF543: Compare1 (sort.c:2930)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
Ran into precompressed term

@alonli1
Copy link

alonli1 commented May 14, 2024

Hi all,

I'm new to form and i ran into a similar error:

7FillInputGZIP: Error in gzip handling of input. zerror = -3
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes
Program terminating in thread 7 at tensorreduction Line 48 -->

And also:

1FillInputGZIP: Error in gzip handling of input. zerror = -3
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes
Program terminating in thread 1 at expandmomenta Line 76 -->

I'm using tform. To be more specific:
TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers

I've been wondering if you might have any idea why this happens and if there's perhaps some sort of way to fix this.

Side note:
Another interesting error i've seen is:

Error while reading scratch file in GetTerm
Program terminating in thread 0 at tensorreduction Line 9 -->

Any idea what might be the cause?

@jodavies
Copy link
Collaborator Author

Interesting, this one hasn’t come up for a while. How often do you see it? If it happens all the time for you, are you able to share the code which reproduces it?

@alonli1
Copy link

alonli1 commented May 16, 2024

It happens all the time. See attached files. wcdimension.dat that appears in the .frm file is empty
formfiles.zip

@jodavies
Copy link
Collaborator Author

This example is too heavy for debugging, it has run for >2 days (without any crash) so far. It is doing a lot of stagesorting, so many I can set up some stagesort-heavy tests to search for remaining bugs in the gzip system.

@alonli1
Copy link

alonli1 commented May 22, 2024

Interesting. How many cores are you using?

@jodavies
Copy link
Collaborator Author

I was using 8, since you pasted your version as TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers. How long does it take to crash for you?

I have since run some stagesort-heavy tests with some simple scripts which generate lots of terms, with artificially small buffer sizes, and did not see any issues.

@alonli1
Copy link

alonli1 commented May 22, 2024

Some of the workers ran into this error after a couple of hours but some ran for around a day or so.

@jodavies
Copy link
Collaborator Author

Do you see the problem with the debug build (tvorm)? If so, could you try either running with gdb and getting a stack trace at the crash, or running with this branch (you will need eu-addr2line installed)
https://github.com/jodavies/form/tree/backtrace

@jodavies
Copy link
Collaborator Author

jodavies commented May 23, 2024

Also: what OS and compiler versions are you using?

Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?

@alonli1
Copy link

alonli1 commented May 23, 2024

Do you see the problem with the debug build (tvorm)? If so, could you try either running with gdb and getting a stack trace at the crash, or running with this branch (you will need eu-addr2line installed) https://github.com/jodavies/form/tree/backtrace

I'll try.

@alonli1
Copy link

alonli1 commented May 23, 2024

Also: what OS and compiler versions are you using?

Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?

I'm using the HPC cluster of my university. The OS used is Linux Rocky 8, not sure about the compiler.
And yeah i can do that.

@jamievicary
Copy link

Novice FORM user here, also getting this bug, in this case when my term reached 1 Gb in size:

Time =    1309.76 sec
           JJJJJ         Terms active    =  154494505
                         Bytes used      =  910409679

Time =    1349.28 sec
           JJJJJ         Terms active    =  160670804
                         Bytes used      =  946051262

Time =    1387.71 sec
           JJJJJ         Terms active    =  166180424
                         Bytes used      =  977206982

Time =    1423.05 sec
           JJJJJ         Terms active    =  172187549
                         Bytes used      = 1015540137

Time =    1427.72 sec
           JJJJJ         Terms active    =  172799021
                         Bytes used      =  995310080
FillInputGZIP: Error in gzip handling of input. zerror = -3
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading F2BA8 bytes
Program terminating at p5a9.frm Line 62 --> 
  1427.75 sec out of 1431.18 sec

@jodavies
Copy link
Collaborator Author

If you can reproduce this reliably, could you share your code and FORM settings, so that I can try to investigate?

@jamievicary
Copy link

Yes can do, would you be able to email me then I can send the data. I'm at [email protected].

@jodavies
Copy link
Collaborator Author

I took another look at #95 (comment) since I wanted to determine if it is something that can happen also in zlib mode (with different buffer sizes or expressions etc?).

This branch adds a bunch of debug prints which you can use to compare the running with and without zlib: https://github.com/jodavies/form/tree/issue-95

I trimmed the test to:

#:filepatches        4
#:largesize      25600
#:maxtermsize      200
#:smallsize      12800
#:termsinsmall      16

S x,k;
L F = sum_(k,1,323,x^k);
.end

This crashes with the Ran into precompressed term as far back as form 3.3.

The test is fixed by changing the the ncomp arg from 1 to 0 in this PutOut but I don't know if this is a fix or a "workaround".
https://github.com/jodavies/form/blob/f17e16b6f89ea0576c12db7e93e5edb43f4da8d9/sources/sort.c#L1073

The particular situation is that the large buffer has been filled and there is a patch on the disk. Then there are some terms in the small buffer, but no large patches, and we finish generating terms (powers 321, 322 and 323 are in the small buffer).

EndSort then writes the small buffer terms into a file patch, before calling MergePatches to finish up. At this point the terms in the small buffer have been compressed already (by EndSort calling ComPress) and go through this PutOut above.

PutOut doesn't care that the terms are already compressed, since the output is not going to AR.outfile or AR.hidefile.

So far this is the same for zlib and non-zlib modes.

The difference is that without zlib, PutOut writes the first term that came from the small buffer in a compressed form, so that when it is loaded again things go wrong: this seems wrong, as I understand the first term of the patch should be complete, and only the following terms are compressed. Indeed this is how the terms arrive from the compressed small buffer.

@jodavies
Copy link
Collaborator Author

Now I think I have it:

form/sources/sort.c

Lines 1059 to 1060 in 92f3154

#ifdef WITHZLIB
*AR.CompressPointer = 0;

This reset of AR.CompressPointer is the wrong side of the #ifdef .

Without zlib, this part of the code compresses the first term written out against whatever happened to be in the compression buffer previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants