Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gromacs offsets files can't be checked for correct offsets #631

Closed
kain88-de opened this issue Jan 18, 2016 · 23 comments · Fixed by #656
Closed

gromacs offsets files can't be checked for correct offsets #631

kain88-de opened this issue Jan 18, 2016 · 23 comments · Fixed by #656

Comments

@kain88-de
Copy link
Member

While rewriting the xdrlib-code I noticed that the checks for correct offsets for xtc and trr files have been wrong. To be specific the ctime and size checks are ok and work, the actual frame offsets that are saved aren't checked.

To see if the offsets are correct the old code only checked if the current frame is readable. Since this happened mostly when I started to read from the beginning it was only testing if I can reach frame 0. which is quite trivial. The problem is that this does only check if the current frame can be reached it says nothing about the other frames.

So lets assume the correct offsets are 1 2 3 5 8 13. We are at frame 3 (0-indexed). Now the loaded offsets are 1 2 3 5 9 13. Here the check would pass but we still couldn't jump to the next frame.

Since the frames are not equally sized due to the compression the only way to completely check the offsets integrity is by testing each single one, which is equal to scanning the complete file again.

This is not a good solution since it would take a long time for most long/big trajectories. My suggestion would be to check to ignore the offsets at the object construction and just rely on ctime and size. When we then encounter a problem seeking/jumping through the file we issue a warning to the user, reload the offsets and try the seek/jump again.

This might mean a single jump is slow in a session but the user should be able to continue without problems.

@richardjgowers
Copy link
Member

I think old code also seeked to the final frame too to check?

Can you hash an xtc/trr effectively? Wouldn't saving the offsets against something better like an md5 hash make this less common?

@kain88-de
Copy link
Member Author

@richardjgowers even going to the last frame and back won't be a 100% accurate check. It would only confirm that the offset for the last frame is correct.

I doubt that an md5 would do good here (besides it will likely be slow for big files). The essential problem is still that we would need to check every single frame to be sure that the offsets are correct.

@richardjgowers
Copy link
Member

Yeah, so I guess wrapping each read in a try/except and then rebuilding the offsets in the except makes sense.

@mnmelo
Copy link
Member

mnmelo commented Jan 18, 2016

First of all sorry for having been away from the xdrlib discussion. I'll try to test it and devote more time now to checking it.

As to the problem at hand, we had discussed it in #208. Checking for the last frame is not that fragile, since, as @richardjgowers commented, the filesize must also match. For an error to go through undetected some frames before the last would have had to shrink, others grow, and the filesize kept constant. Anyway, since it's little overhead, wrapping everything in a try: except: seems indeed the way to be perfectly sure.

And if we're going the safe way, I'd vote for removing the ctime check. Often I copy trajectories around (from cluster to workstation, for instance) and if I'm not careful to preserve timestamps I get regeneration of offsets (ctime might be even impossible to preserve accurately when copying between different filesystems). Since we'll have the try: except: check, ctime won't really matter, or am I missing some bug potential?

@kain88-de
Copy link
Member Author

For an error to go through undetected some frames before the last would have had to shrink, others grow, and the filesize kept constant.

This assumes that the offset file has been written by MDAnalysis and wasn't tempered with by external programs or fate. It might be overly paranoid but my suggestion would remove that assumption and even work if a single offset is changed due to an error in copying the file.

I'd vote for removing the ctime check

Actually not a bad idea with regard to file copying. And when we reload offsets on fail it should also work transparently.

@richardjgowers
Copy link
Member

While we're at it, how xdr specific is the offset stuff currently? Because there's still #314 where maybe we can put offsets into a mixin class and only have this headache once :D

@kain88-de
Copy link
Member Author

It is still not super generic.

The _xdrfile class can use offsets to speed-up reading. But it doesn't check them and shouldn't do that. The offsets can be retrieved and set.

_XDRBase is responsible for loading and storing the offsets to files. There we should also check if access of arbitrary frames works. The arbitrary frame check would only use functions defined in base.ProtoReader. But the Reader class needs to hold an reference to a trajectory format file object that supports the usage of offsets internally.

remark: I choose this design because it meant that I didn't need to overwrite functions from base.ProtoReader.

@dotsdl
Copy link
Member

dotsdl commented Jan 18, 2016

And if we're going the safe way, I'd vote for removing the ctime check. Often I copy trajectories around (from cluster to workstation, for instance) and if I'm not careful to preserve timestamps I get regeneration of offsets (ctime might be even impossible to preserve accurately when copying between different filesystems). Since we'll have the try: except: check, ctime won't really matter, or am I missing some bug potential?

The ctime check is actually important. I'll quote myself:

Since this feature comes entirely as a convenience, it should only work when it
can guarantee that the trajectory file hasn't changed since the offsets were
produced. Using ctime seemed the more conservative of the time-based checks one
could do, since using mtime would still allow the very small possibility of
another file with the same mtime replacing the trajectory. But we could
certainly add other criteria as well to make it safer.

As you noted @kain88-de, none of the checks on frame offsets, file size, etc., short of going through every frame (negating the point of persistent offsets completely) guarantee that the offsets aren't stale. Using the ctime, however, guarantees that the trajectory file hasn't changed since the offsets were persisted. Though it also is sensitive to file moves, I think for this convenience that is a feature, not a bug.

This assumes that the offset file has been written by MDAnalysis and wasn't tempered with by external programs or fate.

I think this is overly paranoid, because those files would have to be edited in a very particular way to not trigger an exception when they are deserialized. I don't think this is something we have to go to pains to guard against.

It might be overly paranoid but my suggestion would remove that assumption and even work if a single offset is changed due to an error in copying the file.

I'm confused as to what the suggestion is; the sentence reads:

My suggestion would be to check to ignore the offsets at the object construction and just rely on ctime and size.

but I'm not sure what that means. Are you saying that we just ignore the persistent offsets? If so there's little point in making them.

@dotsdl
Copy link
Member

dotsdl commented Jan 18, 2016

Also, I think the current behavior is such that if loading offsets fails (as could happen if the file was edited), then offsets get regenerated and the file replaced (if write access exists in the directory).

Are you suggesting that the behavior should be such that after persistent offsets are loaded (after all the existing safety checks), if at any point loading a frame fails, offsets are rebuilt? That would be new and nice.

@kain88-de
Copy link
Member Author

if at any point loading a frame fails, offsets are rebuilt? That would be new and nice.

That is my suggestions.

@dotsdl
Copy link
Member

dotsdl commented Jan 21, 2016

Okay, that sounds good to me. Is there any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception? If not, then you're right that we can remove the checks.

@mnmelo
Copy link
Member

mnmelo commented Jan 24, 2016

Well, there's an impossibly small chance that a trajectory gets rewritten in such a way that the filesize is the same but it has frames half as large (say, your trajectory grows after calculating offsets, and then you write out a subset of atoms). I guess you could maliciously construct such a trajectory by hand.

In this case it could happen that the stale frame offsets still point to valid frame starts, but they'll be non-consecutive.

Checking for either ctime or number-of-atoms would protect against this. As per the points above, I'd vote for checking natoms, if we're really paranoid.

(I guess one can also hand make a trajectory with matching natoms, and fudge with the coordinate magnitude and precision so that one gets twice the frames in half the size. But then again, a ctime is much easier to forge by hand...)

@kain88-de
Copy link
Member Author

number of atoms is a nice idea. I'll add a check for that as well.

@marscher
Copy link

Is it even worth caching the offsets array? I experienced only very small delays even for huge files (>20g) below 10 seconds, which is an absolutely acceptable slowdown in my opinion. And it is optional, if used for seeking or frame number determination.

@dotsdl
Copy link
Member

dotsdl commented Jan 25, 2016

@marscher it isn't for single trajectories of that size, no. But for trajectories that are much larger (~500GB), as well as many smaller trajectories loaded with the ChainReader, the time it takes to get the numframes can be rather long, on the order of 1 - 10 minutes. For a single file you won't really see the time it takes to start up typically because the offsets aren't built until numframes is called.

See here for the original discussion on this point.

@marscher
Copy link

Thanks for clarifying. We are also caching this quantity in PyEMMA's trajectory reader, so probably we should also add a cache for the offset then.

@kain88-de
Copy link
Member Author

I experienced only very small delays even for huge files (>20g) below 10 seconds,

I would guess you are experiencing your hardware cache here. I already have more then 10 seconds for a 1GB xtc on a SSD after a boot. The only thing that counts here is the first time you read the trajectory after a reboot after that most of the file will be in the on-RAM cache.

modern OSes are pretty smart and use RAM that is not claimed by any program to cache HDD/SDD reads. This makes benchmarking things like this very hard. There are some commands on linux to flush that cache but I can't find them right now.

@dotsdl
Copy link
Member

dotsdl commented Jan 25, 2016

@marscher we're hoping to generalize this offset scheme to all our trajectory readers in the near-ish future, to enable fast random access on any format, even plaintext ones. For some of these persistence might not be needed or desired, but using the ChainReader for a sequence of files shows that building the offsets each time gets more expensive with the number of files. So, we'll see, but for big or numerous XTCs there's definitely gain to be had.

@mnmelo
Copy link
Member

mnmelo commented Feb 6, 2016

@kain88-de is wrapping up his PR (#656) to address all the aspects discussed here. As changes are right now, the ctime check is staying in place. As I stated earlier I vote for it to be ditched, but @dotsdl had concerns about it.
@dotsdl, do you still think we should keep ctime checking, even with the two new layers of control? (matching n_atoms and rebuilding offsets on read error). If so, then we keep ctime checking and that's it. If you now feel otherwise, then @kain88-de please remove the ctime check in your PR.

@dotsdl
Copy link
Member

dotsdl commented Feb 6, 2016

If there is any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception, then we should leave the ctime check, because this guards against changes to the file at the filesystem level, but admittedly it's a big stick. If not, then I'm in favor of removing the ctime check.

@mnmelo
Copy link
Member

mnmelo commented Feb 7, 2016

I'll concede that there's an infinitesimal chance some legitimate trajectory manipulation might cause offsets to skip frames while still being valid.

I really don't think it'll ever really crop up (unless malicious tampering is done, in which case ctime is just as fragile) but I also don't want to be the one debugging it if it ever happens!
So, the ctime check stays.

@dotsdl
Copy link
Member

dotsdl commented Feb 7, 2016

@mnmelo though moot since it's merged, would the scheme without ctime checking have worked if a user had appended to the trajectory? In this case there would be no read errors, but the file wouldn't be fully mapped by the persistent offsets.

I don't think we have to care about malicious tampering (I don't know why that's reasonably a case to fend against), but in normal use by a user this performance feature shouldn't break functionality.

@mnmelo
Copy link
Member

mnmelo commented Feb 7, 2016

Appending would be caught by the change in filesize.

My concern is the following case:
a user concatenates several xtc and, at the same time, lowers their precision. In the very off-chance that filesize is kept and some offsets, including the last one, align with our cached ones, we may have frames read without error in the wrong place.
This schematizes it:

old.xtc
frames
    1      2      3      ...
|------|------|------|------|------| end

new xtc
frames
 1   2   3  4   5   ...
|--|---|---|--|---|--|--|---|--|---| end

You can see that trying to directly access frame 3 of the new xtc using the offsets of the old one would succeed, but actually read frame 5. As I said, almost impossibly unlikely, but also very hard to detect and fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants