-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip errors #95
Comments
Hi Josh, For now it does not ring a bell. How often has this occurred to you? There is a still undocumented feature (I have not yet tested it. It was built by Ali Mirsoleimani) to use Jos
|
I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files. It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer. Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem). |
Hi Josh, gzip compression is only used for the sort files. In the large buffer only the standard form compression Jos
|
I have also seen, twice, errors like
It is possible that these are real hardware errors. I will try to investigate whether this is the case. Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches. I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting. Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.
|
Hi Josh, A good setting for TermsInSmall is SmallSize/(typical term size). If you can find out more about these crashes, maybe we can sooner or later make them deterministic and Jos
|
I have exactly the same issue (with default form settings), and since it is reproducible for me I did a bisection to track it down from a good 4.1 tag and a bad head to: " Fixed the compress/gzip/hide bug." 43a5b1e. Hope this helps! Cheers, Tobias |
@burp , are you able to provide a script for your reproducible example, if it is small and crashes quickly? I have a calculation which has crashed the last 4 times I have run it, but it produces 2TB of scratch files and takes more than a day to crash. |
It crashes within 1-2 minutes, that's why I was able to do a quick bisection. Unfortunately it is part of a bigger setup with tons of includes etc. I will try to boil it down to one simple script/file. |
@burp , were you able to create a simple example which crashes like this in the end? I have had this crash on a second machine now, making genuine disk errors less likely. Should we also have |
@joedavies: Have not had time for it yet, maybe I can just send you the setup in private. You will have to adjust a few include paths etc., but it's probably simpler for me than trying to minimize the reproducible example. |
That works, if it is OK with you! Probably it is best if you can send Jos a copy also, he is much more likely to be able to find possible bugs. |
I had a quick look at this example. With GZIPDEBUG enabled at various places, during normal operation, one sees output like the following,
In the example which crashes, we find
It seems like FORM is trying to read from stream 0, which was already "closed"? The offending call of |
This crash does not occur if one changes MaxTermSize (and thus the buffersize passed to The crash scenario seems to be:
EDIT: there are no valgrind errors |
@jodavies Is there any public test code to reproduce the bug? |
I am using the code from @burp , of which I think Jos also has a copy. I have not been able to reduce it to a particularly minimal crashing example. |
I have made a nice example which demonstrates this crash. The script is
I run this as an array task on a cluster, to scan over NTERMS. I grep the log file for "test = 0;" and keep a copy of failures. I am using the current git version of form, with GZIPDEBUG enabled in sort.c and compress.c. I have found the following crashing examples in the range 250K to 500K:
(generated by 254406 + 12500 n) with output like
I hope these "small-ish" and quickly running examples are useful. |
Changing Line 4065 in 480a787
Line 3996 in 480a787
Perhaps it fixes my example "by accident"... |
I have just tried running this script with a form binary compiled Again, I ran for
for EVERY value of As before, there are no valgrind errors before the crash. If I use a binary with or without zlib, but run with |
The commit 26793e4 is expected to fix the error like
which was caused by calling |
Excellent! I have re-tested the script in #95 (comment) , and everything seems OK for NTERMS in the range 250K to 500K. Also @burp 's example now runs without crashing. The test of #95 (comment) still fails. I tested #214 also: still fails. |
Thanks for the testing. Nice to hear that the GZIP crash is gone. The crash with |
I just tried to reduce numbers in #95 (comment). The following example crashes at
Somehow I had Valgrind errors:
|
Hi all, I'm new to form and i ran into a similar error: 7FillInputGZIP: Error in gzip handling of input. zerror = -3 And also: 1FillInputGZIP: Error in gzip handling of input. zerror = -3 I'm using tform. To be more specific: I've been wondering if you might have any idea why this happens and if there's perhaps some sort of way to fix this. Side note: Error while reading scratch file in GetTerm Any idea what might be the cause? |
Interesting, this one hasn’t come up for a while. How often do you see it? If it happens all the time for you, are you able to share the code which reproduces it? |
It happens all the time. See attached files. wcdimension.dat that appears in the .frm file is empty |
This example is too heavy for debugging, it has run for >2 days (without any crash) so far. It is doing a lot of stagesorting, so many I can set up some stagesort-heavy tests to search for remaining bugs in the gzip system. |
Interesting. How many cores are you using? |
I was using 8, since you pasted your version as I have since run some stagesort-heavy tests with some simple scripts which generate lots of terms, with artificially small buffer sizes, and did not see any issues. |
Some of the workers ran into this error after a couple of hours but some ran for around a day or so. |
Do you see the problem with the debug build (tvorm)? If so, could you try either running with |
Also: what OS and compiler versions are you using? Could you also define GZIPDEBUG in sort.c and compress.c, for your tests? |
I'll try. |
I'm using the HPC cluster of my university. The OS used is Linux Rocky 8, not sure about the compiler. |
Novice FORM user here, also getting this bug, in this case when my term reached 1 Gb in size:
|
If you can reproduce this reliably, could you share your code and FORM settings, so that I can try to investigate? |
Yes can do, would you be able to email me then I can send the data. I'm at [email protected]. |
I took another look at #95 (comment) since I wanted to determine if it is something that can happen also in zlib mode (with different buffer sizes or expressions etc?). This branch adds a bunch of debug prints which you can use to compare the running with and without zlib: https://github.com/jodavies/form/tree/issue-95 I trimmed the test to:
This crashes with the The test is fixed by changing the the ncomp arg from 1 to 0 in this PutOut but I don't know if this is a fix or a "workaround". The particular situation is that the large buffer has been filled and there is a patch on the disk. Then there are some terms in the small buffer, but no large patches, and we finish generating terms (powers 321, 322 and 323 are in the small buffer). EndSort then writes the small buffer terms into a file patch, before calling MergePatches to finish up. At this point the terms in the small buffer have been compressed already (by EndSort calling ComPress) and go through this PutOut above. PutOut doesn't care that the terms are already compressed, since the output is not going to AR.outfile or AR.hidefile. So far this is the same for zlib and non-zlib modes. The difference is that without zlib, PutOut writes the first term that came from the small buffer in a compressed form, so that when it is loaded again things go wrong: this seems wrong, as I understand the first term of the patch should be complete, and only the following terms are compressed. Indeed this is how the terms arrive from the compressed small buffer. |
Now I think I have it: Lines 1059 to 1060 in 92f3154
This reset of AR.CompressPointer is the wrong side of the #ifdef . Without zlib, this part of the code compresses the first term written out against whatever happened to be in the compression buffer previously. |
Several times, I have seen errors like
They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).
I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?
Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.
The text was updated successfully, but these errors were encountered: