Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

valgrind errors when creating tablebases #318

Closed
jodavies opened this issue Aug 27, 2019 · 13 comments
Closed

valgrind errors when creating tablebases #318

jodavies opened this issue Aug 27, 2019 · 13 comments
Labels
bug Something isn't working

Comments

@jodavies
Copy link
Collaborator

The following program produces some invalid reads and writes in minos.c: PutTableNames. I can't always get it to run successfully, it sometimes crashes.

Symbol x,y,z;
TableBase "test.tbl" create;
.sort

#do i = 1, 600
	#message `i'
	Table sparse, zerofill, inttab`i'(2);
	#do j = 1,100
		Fill inttab`i'(`j',1) = x;
	#enddo
	TableBase "test.tbl" addto inttab`i';
	.sort
#enddo
.end

I can't remember why I made this test file (some months ago -- I think someone asked me for help with crashes related to tablebases containing many different tables) but it seems that it sometimes (not always) creates something like "sparse" files, which claim to be much bigger than they really are.

The size seems to depend on the filesystem, but when this program misbehaves it seems to produce a 1.9TB file on ext4, and a 5.6EB file (!!!) on nfs. Needless to say, this caused some problems when rsync attempted to back this file up to a remote location. The true size of the file is just 20MB.

Perhaps the errors reported by valgrind are responsible for the occasional crazy files?

I attach a file with the output:
vg.log

Thanks,
Josh.

@tueda tueda added the bug Something isn't working label Aug 28, 2019
@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

This looks a bug in a boilerplate code for a kind of std::vector<NAMESBLOCK>::reserve().

form/sources/minos.c

Lines 965 to 988 in 558b01f

if ( d->info.numberofnamesblocks < numblocks ) {
/*
We need more blocks. First make sure of the space for nblocks.
*/
if ( ( nnew = (NAMESBLOCK **)Malloc1(sizeof(NAMESBLOCK *)*numblocks,
"new names block") ) == 0 ) {
return(-1);
}
for ( i = 0; i < d->info.numberofnamesblocks; i++ ) {
nnew[i] = d->nblocks[i];
}
for ( ; i < numblocks; i++ ) {
if ( ( d->nblocks[i] = (NAMESBLOCK *)Malloc1(sizeof(NAMESBLOCK),
"additional names blocks ") ) == 0 ) {
FreeTableBase(d);
return(-1);
}
d->nblocks[i]->previousblock = -1;
d->nblocks[i]->position = -1;
s = d->nblocks[i]->names;
for ( j = 0; j < NAMETABLESIZE; j++ ) *s++ = 0;
}
d->info.numberofnamesblocks = numblocks;
}

nnew is the new extended buffer but entirely not used. Some of writings to the buffer should goes to nnew. But the code is a bit chaotic in how the table should be updated in the function and so for now I'm not sure when nnew should replace with the old vector buffer pointer...

@jodavies
Copy link
Collaborator Author

jodavies commented Aug 28, 2019

Yes, it looks like this is the problem.

free(d->nblocks); 
d->nblocks = nnew;

after the first for loop clears up all valgrind issues apart from the invalid write at minos.c:1008.

I guess that the fseek at minos.c:1033 was previously seeking to a ridiculous place in the file, causing these exabyte "sparse" files.

@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

after the first for loop

I'm not sure that this is the right place, because the comment

form/sources/minos.c

Lines 989 to 991 in 558b01f

/*
Now look till where the new contents agree with the old.
*/

sounds like the pointer to the old buffer is still needed to do something, maybe for some checks and setup of the new blocks, until the end of the function.

@jodavies
Copy link
Collaborator Author

Well, only d->nblocks is replaced with a larger array, and the *nblocks pointers are copied. Everything else in d-> remains intact. Perhaps I misunderstand something, and for sure something is still wrong because of the invalid write later on.

At least, the example above no longer crashes or creates enormous output, and I can load and use the table in other script.

@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

OK, I'm also not familiar with this minos code, and perhaps you are right.
Indeed, the fact that your example works is an improvement.

Do you want to make a commit and pull request for this? Then I will merge it.

@vermaseren
Copy link
Owner

vermaseren commented Aug 28, 2019 via email

jodavies pushed a commit to jodavies/form that referenced this issue Aug 28, 2019
The newly created larger array was not swapped with the old array.
This caused lots of invalid array accesses and, sometimes, an fseek
to an undefined location resulting in incorrectly sized output files.
See Issue vermaseren#318 .

Also when comparing contents the block under consideration was not
correctly updated. This caused valgrind to report an Illegal write.
I have not found any actual runtime issues caused by this.

Could this be the cause of an old issue from the forum, where I
described how FORM writes a really enormous amount of data to disk
when creating relatively small tablebase files? For this reason it
is a very bad idea to create a tablebase directly in a network
directory.
@jodavies
Copy link
Collaborator Author

My fork is a real mess it seems. Are you able to cherry-pick a single commit, or just re-implement the change in this repo?

@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

OK. Then I will cherry-pick it from your fork.

tueda pushed a commit that referenced this issue Aug 28, 2019
The newly created larger array was not swapped with the old array.
This caused lots of invalid array accesses and, sometimes, an fseek
to an undefined location resulting in incorrectly sized output files.
See Issue #318 .

Also when comparing contents the block under consideration was not
correctly updated. This caused valgrind to report an Illegal write.
I have not found any actual runtime issues caused by this.

Could this be the cause of an old issue from the forum, where I
described how FORM writes a really enormous amount of data to disk
when creating relatively small tablebase files? For this reason it
is a very bad idea to create a tablebase directly in a network
directory.
@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

Thanks! This is done. (And hopefully, the CI will finish without any errors...)

@jodavies
Copy link
Collaborator Author

Perhaps this warrants a separate issue, but some more detail on the commentry of that commit:

If you take the same test file from the first post here, and put

#system cat /proc/`PID_'/io

before .end, you can inspect the difference between wchar and write_bytes. In this example, running from a local disk, wchar is about 18GB and write_bytes is 19.4MB (the size of test.tbl).

This means, I think, that FORM has written 18GB to disk in total but has over-written the same data many times, and only 19.4MB was actually committed to disk in the end by the OS.

If you run the same example on a network storage location, these numbers roughly coincide, and if you look at some sort of network throughput monitor while FORM is running you will see that your machine really is transferring all of that 18GB to the file server.

I am not sure why this is, but for this reason I always create tablebases in a tmpfs, and transfer to the proper place afterwards.

@vermaseren
Copy link
Owner

vermaseren commented Aug 28, 2019 via email

@tueda
Copy link
Collaborator

tueda commented Aug 28, 2019

The CI successfully finished, so I would like to close this issue for now.
(But of course we can still discuss the heavy traffic problem here, or we can make and move to another issue if we have ideas for performance improvement).

@tueda tueda closed this as completed Aug 28, 2019
@jodavies
Copy link
Collaborator Author

jodavies commented Nov 5, 2019

The examples in this issue all relate to tablebases containing hundreds of tables.

Just for information, 0b3ab5d also fixes problems with tablebases containing a single table with millions of entries.

In this case the tablebase was created without apparent errors, but upon applying it to an expression in another form script, we received the errors Could not read rhs and Cannot uncompress element %s in file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants