-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frame offset files for all Readers to universaly support random access #314
Comments
I think a good thing to do for this would be to move the offset logic (reading and saving these as made) out of TRR/XTC and maybe into a mixin class which other Readers can then use |
Could we get a checklist of formats that could do with an offsets mechanism? I'd be okay with working on pulling the offsets mechanism out of the xdrfiles |
TRZ doesn't need to store offsets as the frames are all of equal size, I assume this is how DCD works too? Uncompressed ASCII formats, (PDB, DL_Poly, XYZ, anything else?) can all save their offsets as they read (this is already done for DL_Poly, just it doesn't store them) Compressed ASCII might be trickier as Oliver said, but should be possible...... I think the annoying thing is the compressed/uncompressed both go through the same Reader class so any solution needs to handle them immediately. |
@richardjgowers : DCD frames are of fixed size. I think there's a complication with fixed coordinates and perhaps the first frame is different but judging from the comments in the code, fixed coordinates don't work anyway (and I've never used them). |
Note the discussion at #441 regarding the best format to store the offsets on disk. Currently the best solution (according to benchmarks) appears to be numpy.savez. |
Btw Numpy is on the same order of magnitude once you actually read the data. Just doing the dump doesn't read anything. But I'll stick with it because it can handle numpy arrays nicely https://gist.github.com/kain88-de/5487e9d9c8f7d0e4995b |
The new continuous chain reader code PR #1728 assumes that all readers can do random access, i.e., u.trajectory[frame] I vaguely remember that @kain88-de said in a recent discussion that all our readers can do random access but I cannot find his comment for linking and perhaps I misremember it. EDIT: PR #1086 fixed issue #1081 ("sliced iterators") and it seems it adds random access to any multiframe readers that were not able to do it. Before closing this issue I'd like to briefly hear from @MDAnalysis/coredevs if I am overlooking something here. Can all our readers do fast random access nowadays? Do we test this anywhere? And if it is true, we should make sure the docs mention it! (#1827) |
I think it's true yes. The Readers that didn't use to do this were the ascii based ones, but I hacked those together. I guess it wasn't added to the API because I didn't want it to become a hard requirement even if it was true at the time? We've also got stream readers, which I've never used, and I'm not sure they can do seeking, I wouldn't imagine so.. |
Just saw your comment while I updated mine with your PR #1086. |
I think we can then close this issue. We certainly don't need offset files for all formats. (We should require any file-based readers to be able to do fast random access, just so that we or rather users are not suddenly experiencing abysmal performance for one trajectory format. But we can discuss this separately.) |
I think the persistent frame offsets for Gromacs trajectories are a really neat addition and enable a lot of new analysis workflows that were not really feasible with XTC/TRR before. I'd like to see this framework extended to any of the other readers that do not support random access at the moment (notable exceptions are DCD and netcdf, which can do this automatically – any others?).
If random access is guaranteed to work then it becomes a lot more straight forward to parallelize over trajectory chunks, no matter what the underlying trajectory format is.
There are probably going to be problems with trajectory streams and/or compressed formats (e.g. xyz.bz2) but perhaps someone can think of a solution.
The text was updated successfully, but these errors were encountered: