-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gromacs offsets files can't be checked for correct offsets #631
Comments
I think old code also seeked to the final frame too to check? Can you hash an xtc/trr effectively? Wouldn't saving the offsets against something better like an md5 hash make this less common? |
@richardjgowers even going to the last frame and back won't be a 100% accurate check. It would only confirm that the offset for the last frame is correct. I doubt that an md5 would do good here (besides it will likely be slow for big files). The essential problem is still that we would need to check every single frame to be sure that the offsets are correct. |
Yeah, so I guess wrapping each read in a try/except and then rebuilding the offsets in the except makes sense. |
First of all sorry for having been away from the xdrlib discussion. I'll try to test it and devote more time now to checking it. As to the problem at hand, we had discussed it in #208. Checking for the last frame is not that fragile, since, as @richardjgowers commented, the filesize must also match. For an error to go through undetected some frames before the last would have had to shrink, others grow, and the filesize kept constant. Anyway, since it's little overhead, wrapping everything in a And if we're going the safe way, I'd vote for removing the ctime check. Often I copy trajectories around (from cluster to workstation, for instance) and if I'm not careful to preserve timestamps I get regeneration of offsets (ctime might be even impossible to preserve accurately when copying between different filesystems). Since we'll have the |
This assumes that the offset file has been written by MDAnalysis and wasn't tempered with by external programs or fate. It might be overly paranoid but my suggestion would remove that assumption and even work if a single offset is changed due to an error in copying the file.
Actually not a bad idea with regard to file copying. And when we reload offsets on fail it should also work transparently. |
While we're at it, how xdr specific is the offset stuff currently? Because there's still #314 where maybe we can put offsets into a mixin class and only have this headache once :D |
It is still not super generic. The _xdrfile class can use offsets to speed-up reading. But it doesn't check them and shouldn't do that. The offsets can be retrieved and set. _XDRBase is responsible for loading and storing the offsets to files. There we should also check if access of arbitrary frames works. The arbitrary frame check would only use functions defined in remark: I choose this design because it meant that I didn't need to overwrite functions from |
The
As you noted @kain88-de, none of the checks on frame offsets, file size, etc., short of going through every frame (negating the point of persistent offsets completely) guarantee that the offsets aren't stale. Using the
I think this is overly paranoid, because those files would have to be edited in a very particular way to not trigger an exception when they are deserialized. I don't think this is something we have to go to pains to guard against.
I'm confused as to what the suggestion is; the sentence reads:
but I'm not sure what that means. Are you saying that we just ignore the persistent offsets? If so there's little point in making them. |
Also, I think the current behavior is such that if loading offsets fails (as could happen if the file was edited), then offsets get regenerated and the file replaced (if write access exists in the directory). Are you suggesting that the behavior should be such that after persistent offsets are loaded (after all the existing safety checks), if at any point loading a frame fails, offsets are rebuilt? That would be new and nice. |
That is my suggestions. |
Okay, that sounds good to me. Is there any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception? If not, then you're right that we can remove the checks. |
Well, there's an impossibly small chance that a trajectory gets rewritten in such a way that the filesize is the same but it has frames half as large (say, your trajectory grows after calculating offsets, and then you write out a subset of atoms). I guess you could maliciously construct such a trajectory by hand. In this case it could happen that the stale frame offsets still point to valid frame starts, but they'll be non-consecutive. Checking for either ctime or number-of-atoms would protect against this. As per the points above, I'd vote for checking natoms, if we're really paranoid. (I guess one can also hand make a trajectory with matching natoms, and fudge with the coordinate magnitude and precision so that one gets twice the frames in half the size. But then again, a ctime is much easier to forge by hand...) |
number of atoms is a nice idea. I'll add a check for that as well. |
Is it even worth caching the offsets array? I experienced only very small delays even for huge files (>20g) below 10 seconds, which is an absolutely acceptable slowdown in my opinion. And it is optional, if used for seeking or frame number determination. |
@marscher it isn't for single trajectories of that size, no. But for trajectories that are much larger (~500GB), as well as many smaller trajectories loaded with the See here for the original discussion on this point. |
Thanks for clarifying. We are also caching this quantity in PyEMMA's trajectory reader, so probably we should also add a cache for the offset then. |
I would guess you are experiencing your hardware cache here. I already have more then 10 seconds for a 1GB xtc on a SSD after a boot. The only thing that counts here is the first time you read the trajectory after a reboot after that most of the file will be in the on-RAM cache. modern OSes are pretty smart and use RAM that is not claimed by any program to cache HDD/SDD reads. This makes benchmarking things like this very hard. There are some commands on linux to flush that cache but I can't find them right now. |
@marscher we're hoping to generalize this offset scheme to all our trajectory readers in the near-ish future, to enable fast random access on any format, even plaintext ones. For some of these persistence might not be needed or desired, but using the |
@kain88-de is wrapping up his PR (#656) to address all the aspects discussed here. As changes are right now, the |
If there is any possibility that wrong offsets could be used to read part of a trajectory and not trigger an exception, then we should leave the |
I'll concede that there's an infinitesimal chance some legitimate trajectory manipulation might cause offsets to skip frames while still being valid. I really don't think it'll ever really crop up (unless malicious tampering is done, in which case |
@mnmelo though moot since it's merged, would the scheme without I don't think we have to care about malicious tampering (I don't know why that's reasonably a case to fend against), but in normal use by a user this performance feature shouldn't break functionality. |
Appending would be caught by the change in filesize. My concern is the following case:
You can see that trying to directly access frame 3 of the new xtc using the offsets of the old one would succeed, but actually read frame 5. As I said, almost impossibly unlikely, but also very hard to detect and fix. |
While rewriting the xdrlib-code I noticed that the checks for correct offsets for xtc and trr files have been wrong. To be specific the ctime and size checks are ok and work, the actual frame offsets that are saved aren't checked.
To see if the offsets are correct the old code only checked if the current frame is readable. Since this happened mostly when I started to read from the beginning it was only testing if I can reach frame 0. which is quite trivial. The problem is that this does only check if the current frame can be reached it says nothing about the other frames.
So lets assume the correct offsets are
1 2 3 5 8 13
. We are at frame 3 (0-indexed). Now the loaded offsets are1 2 3 5 9 13
. Here the check would pass but we still couldn't jump to the next frame.Since the frames are not equally sized due to the compression the only way to completely check the offsets integrity is by testing each single one, which is equal to scanning the complete file again.
This is not a good solution since it would take a long time for most long/big trajectories. My suggestion would be to check to ignore the offsets at the object construction and just rely on ctime and size. When we then encounter a problem seeking/jumping through the file we issue a warning to the user, reload the offsets and try the seek/jump again.
This might mean a single jump is slow in a session but the user should be able to continue without problems.
The text was updated successfully, but these errors were encountered: