gromacs offsets files can't be checked for correct offsets #631

kain88-de · 2016-01-18T11:06:27Z

While rewriting the xdrlib-code I noticed that the checks for correct offsets for xtc and trr files have been wrong. To be specific the ctime and size checks are ok and work, the actual frame offsets that are saved aren't checked.

To see if the offsets are correct the old code only checked if the current frame is readable. Since this happened mostly when I started to read from the beginning it was only testing if I can reach frame 0. which is quite trivial. The problem is that this does only check if the current frame can be reached it says nothing about the other frames.

So lets assume the correct offsets are 1 2 3 5 8 13. We are at frame 3 (0-indexed). Now the loaded offsets are 1 2 3 5 9 13. Here the check would pass but we still couldn't jump to the next frame.

Since the frames are not equally sized due to the compression the only way to completely check the offsets integrity is by testing each single one, which is equal to scanning the complete file again.

This is not a good solution since it would take a long time for most long/big trajectories. My suggestion would be to check to ignore the offsets at the object construction and just rely on ctime and size. When we then encounter a problem seeking/jumping through the file we issue a warning to the user, reload the offsets and try the seek/jump again.

This might mean a single jump is slow in a session but the user should be able to continue without problems.

The text was updated successfully, but these errors were encountered:

richardjgowers · 2016-01-18T11:18:05Z

I think old code also seeked to the final frame too to check?

Can you hash an xtc/trr effectively? Wouldn't saving the offsets against something better like an md5 hash make this less common?

kain88-de · 2016-01-18T11:54:10Z

@richardjgowers even going to the last frame and back won't be a 100% accurate check. It would only confirm that the offset for the last frame is correct.

I doubt that an md5 would do good here (besides it will likely be slow for big files). The essential problem is still that we would need to check every single frame to be sure that the offsets are correct.

richardjgowers · 2016-01-18T12:41:49Z

Yeah, so I guess wrapping each read in a try/except and then rebuilding the offsets in the except makes sense.

mnmelo · 2016-01-18T13:21:29Z

First of all sorry for having been away from the xdrlib discussion. I'll try to test it and devote more time now to checking it.

As to the problem at hand, we had discussed it in #208. Checking for the last frame is not that fragile, since, as @richardjgowers commented, the filesize must also match. For an error to go through undetected some frames before the last would have had to shrink, others grow, and the filesize kept constant. Anyway, since it's little overhead, wrapping everything in a try: except: seems indeed the way to be perfectly sure.

And if we're going the safe way, I'd vote for removing the ctime check. Often I copy trajectories around (from cluster to workstation, for instance) and if I'm not careful to preserve timestamps I get regeneration of offsets (ctime might be even impossible to preserve accurately when copying between different filesystems). Since we'll have the try: except: check, ctime won't really matter, or am I missing some bug potential?

kain88-de · 2016-01-18T14:11:18Z

For an error to go through undetected some frames before the last would have had to shrink, others grow, and the filesize kept constant.

This assumes that the offset file has been written by MDAnalysis and wasn't tempered with by external programs or fate. It might be overly paranoid but my suggestion would remove that assumption and even work if a single offset is changed due to an error in copying the file.

I'd vote for removing the ctime check

Actually not a bad idea with regard to file copying. And when we reload offsets on fail it should also work transparently.

richardjgowers · 2016-01-18T14:30:48Z

While we're at it, how xdr specific is the offset stuff currently? Because there's still #314 where maybe we can put offsets into a mixin class and only have this headache once :D

kain88-de · 2016-01-18T15:02:49Z

It is still not super generic.

The _xdrfile class can use offsets to speed-up reading. But it doesn't check them and shouldn't do that. The offsets can be retrieved and set.

_XDRBase is responsible for loading and storing the offsets to files. There we should also check if access of arbitrary frames works. The arbitrary frame check would only use functions defined in base.ProtoReader. But the Reader class needs to hold an reference to a trajectory format file object that supports the usage of offsets internally.

remark: I choose this design because it meant that I didn't need to overwrite functions from base.ProtoReader.

dotsdl · 2016-01-18T17:43:10Z

And if we're going the safe way, I'd vote for removing the ctime check. Often I copy trajectories around (from cluster to workstation, for instance) and if I'm not careful to preserve timestamps I get regeneration of offsets (ctime might be even impossible to preserve accurately when copying between different filesystems). Since we'll have the try: except: check, ctime won't really matter, or am I missing some bug potential?

The ctime check is actually important. I'll quote myself:

Since this feature comes entirely as a convenience, it should only work when it
can guarantee that the trajectory file hasn't changed since the offsets were
produced. Using ctime seemed the more conservative of the time-based checks one
could do, since using mtime would still allow the very small possibility of
another file with the same mtime replacing the trajectory. But we could
certainly add other criteria as well to make it safer.

As you noted @kain88-de, none of the checks on frame offsets, file size, etc., short of going through every frame (negating the point of persistent offsets completely) guarantee that the offsets aren't stale. Using the ctime, however, guarantees that the trajectory file hasn't changed since the offsets were persisted. Though it also is sensitive to file moves, I think for this convenience that is a feature, not a bug.

This assumes that the offset file has been written by MDAnalysis and wasn't tempered with by external programs or fate.

I think this is overly paranoid, because those files would have to be edited in a very particular way to not trigger an exception when they are deserialized. I don't think this is something we have to go to pains to guard against.

It might be overly paranoid but my suggestion would remove that assumption and even work if a single offset is changed due to an error in copying the file.

I'm confused as to what the suggestion is; the sentence reads:

My suggestion would be to check to ignore the offsets at the object construction and just rely on ctime and size.

but I'm not sure what that means. Are you saying that we just ignore the persistent offsets? If so there's little point in making them.

dotsdl · 2016-01-18T20:01:03Z

Also, I think the current behavior is such that if loading offsets fails (as could happen if the file was edited), then offsets get regenerated and the file replaced (if write access exists in the directory).

Are you suggesting that the behavior should be such that after persistent offsets are loaded (after all the existing safety checks), if at any point loading a frame fails, offsets are rebuilt? That would be new and nice.

kain88-de · 2016-01-19T06:43:25Z

if at any point loading a frame fails, offsets are rebuilt? That would be new and nice.

That is my suggestions.

dotsdl · 2016-01-21T04:25:23Z

Okay, that sounds good to me. Is there any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception? If not, then you're right that we can remove the checks.

mnmelo · 2016-01-24T09:13:29Z

Well, there's an impossibly small chance that a trajectory gets rewritten in such a way that the filesize is the same but it has frames half as large (say, your trajectory grows after calculating offsets, and then you write out a subset of atoms). I guess you could maliciously construct such a trajectory by hand.

In this case it could happen that the stale frame offsets still point to valid frame starts, but they'll be non-consecutive.

Checking for either ctime or number-of-atoms would protect against this. As per the points above, I'd vote for checking natoms, if we're really paranoid.

(I guess one can also hand make a trajectory with matching natoms, and fudge with the coordinate magnitude and precision so that one gets twice the frames in half the size. But then again, a ctime is much easier to forge by hand...)

kain88-de · 2016-01-24T12:39:25Z

number of atoms is a nice idea. I'll add a check for that as well.

marscher · 2016-01-25T18:23:49Z

Is it even worth caching the offsets array? I experienced only very small delays even for huge files (>20g) below 10 seconds, which is an absolutely acceptable slowdown in my opinion. And it is optional, if used for seeking or frame number determination.

dotsdl · 2016-01-25T21:21:18Z

@marscher it isn't for single trajectories of that size, no. But for trajectories that are much larger (~500GB), as well as many smaller trajectories loaded with the ChainReader, the time it takes to get the numframes can be rather long, on the order of 1 - 10 minutes. For a single file you won't really see the time it takes to start up typically because the offsets aren't built until numframes is called.

See here for the original discussion on this point.

marscher · 2016-01-25T21:29:39Z

Thanks for clarifying. We are also caching this quantity in PyEMMA's trajectory reader, so probably we should also add a cache for the offset then.

kain88-de · 2016-01-25T22:30:44Z

I experienced only very small delays even for huge files (>20g) below 10 seconds,

I would guess you are experiencing your hardware cache here. I already have more then 10 seconds for a 1GB xtc on a SSD after a boot. The only thing that counts here is the first time you read the trajectory after a reboot after that most of the file will be in the on-RAM cache.

modern OSes are pretty smart and use RAM that is not claimed by any program to cache HDD/SDD reads. This makes benchmarking things like this very hard. There are some commands on linux to flush that cache but I can't find them right now.

dotsdl · 2016-01-25T22:32:12Z

@marscher we're hoping to generalize this offset scheme to all our trajectory readers in the near-ish future, to enable fast random access on any format, even plaintext ones. For some of these persistence might not be needed or desired, but using the ChainReader for a sequence of files shows that building the offsets each time gets more expensive with the number of files. So, we'll see, but for big or numerous XTCs there's definitely gain to be had.

mnmelo · 2016-02-06T02:26:55Z

@kain88-de is wrapping up his PR (#656) to address all the aspects discussed here. As changes are right now, the ctime check is staying in place. As I stated earlier I vote for it to be ditched, but @dotsdl had concerns about it.
@dotsdl, do you still think we should keep ctime checking, even with the two new layers of control? (matching n_atoms and rebuilding offsets on read error). If so, then we keep ctime checking and that's it. If you now feel otherwise, then @kain88-de please remove the ctime check in your PR.

dotsdl · 2016-02-06T18:27:26Z

If there is any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception, then we should leave the ctime check, because this guards against changes to the file at the filesystem level, but admittedly it's a big stick. If not, then I'm in favor of removing the ctime check.

mnmelo · 2016-02-07T12:03:43Z

I'll concede that there's an infinitesimal chance some legitimate trajectory manipulation might cause offsets to skip frames while still being valid.

I really don't think it'll ever really crop up (unless malicious tampering is done, in which case ctime is just as fragile) but I also don't want to be the one debugging it if it ever happens!
So, the ctime check stays.

dotsdl · 2016-02-07T18:54:21Z

@mnmelo though moot since it's merged, would the scheme without ctime checking have worked if a user had appended to the trajectory? In this case there would be no read errors, but the file wouldn't be fully mapped by the persistent offsets.

I don't think we have to care about malicious tampering (I don't know why that's reasonably a case to fend against), but in normal use by a user this performance feature shouldn't break functionality.

mnmelo · 2016-02-07T19:19:11Z

Appending would be caught by the change in filesize.

My concern is the following case:
a user concatenates several xtc and, at the same time, lowers their precision. In the very off-chance that filesize is kept and some offsets, including the last one, align with our cached ones, we may have frames read without error in the wrong place.
This schematizes it:

old.xtc
frames
    1      2      3      ...
|------|------|------|------|------| end

new xtc
frames
 1   2   3  4   5   ...
|--|---|---|--|---|--|--|---|--|---| end

You can see that trying to directly access frame 3 of the new xtc using the offsets of the old one would succeed, but actually read frame 5. As I said, almost impossibly unlikely, but also very hard to detect and fix.

richardjgowers added question Format-Gromacs Component-Readers labels Jan 18, 2016

kain88-de mentioned this issue Jan 22, 2016

Xdrlib offset checking, issue (631) #656

Merged

kain88-de mentioned this issue Jan 28, 2016

XDR file seeks and tells working again on large files (closes #677) #678

Merged

mnmelo closed this as completed in #656 Feb 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gromacs offsets files can't be checked for correct offsets #631

gromacs offsets files can't be checked for correct offsets #631

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

mnmelo commented Jan 18, 2016

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

kain88-de commented Jan 18, 2016

dotsdl commented Jan 18, 2016

dotsdl commented Jan 18, 2016

kain88-de commented Jan 19, 2016

dotsdl commented Jan 21, 2016

mnmelo commented Jan 24, 2016

kain88-de commented Jan 24, 2016

marscher commented Jan 25, 2016

dotsdl commented Jan 25, 2016

marscher commented Jan 25, 2016

kain88-de commented Jan 25, 2016

dotsdl commented Jan 25, 2016

mnmelo commented Feb 6, 2016

dotsdl commented Feb 6, 2016

mnmelo commented Feb 7, 2016

dotsdl commented Feb 7, 2016

mnmelo commented Feb 7, 2016

gromacs offsets files can't be checked for correct offsets #631

gromacs offsets files can't be checked for correct offsets #631

Comments

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

mnmelo commented Jan 18, 2016

kain88-de commented Jan 18, 2016

richardjgowers commented Jan 18, 2016

kain88-de commented Jan 18, 2016

dotsdl commented Jan 18, 2016

dotsdl commented Jan 18, 2016

kain88-de commented Jan 19, 2016

dotsdl commented Jan 21, 2016

mnmelo commented Jan 24, 2016

kain88-de commented Jan 24, 2016

marscher commented Jan 25, 2016

dotsdl commented Jan 25, 2016

marscher commented Jan 25, 2016

kain88-de commented Jan 25, 2016

dotsdl commented Jan 25, 2016

mnmelo commented Feb 6, 2016

dotsdl commented Feb 6, 2016

mnmelo commented Feb 7, 2016

dotsdl commented Feb 7, 2016

mnmelo commented Feb 7, 2016