Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frame offset files for all Readers to universaly support random access #314

Closed
orbeckst opened this issue Jun 16, 2015 · 10 comments
Closed

Comments

@orbeckst
Copy link
Member

I think the persistent frame offsets for Gromacs trajectories are a really neat addition and enable a lot of new analysis workflows that were not really feasible with XTC/TRR before. I'd like to see this framework extended to any of the other readers that do not support random access at the moment (notable exceptions are DCD and netcdf, which can do this automatically – any others?).

If random access is guaranteed to work then it becomes a lot more straight forward to parallelize over trajectory chunks, no matter what the underlying trajectory format is.

There are probably going to be problems with trajectory streams and/or compressed formats (e.g. xyz.bz2) but perhaps someone can think of a solution.

@richardjgowers
Copy link
Member

I think a good thing to do for this would be to move the offset logic (reading and saving these as made) out of TRR/XTC and maybe into a mixin class which other Readers can then use

@dotsdl
Copy link
Member

dotsdl commented Jun 16, 2015

Could we get a checklist of formats that could do with an offsets mechanism? I'd be okay with working on pulling the offsets mechanism out of the xdrfiles TrjReader and generalizing it in a mixin for other Readers later in the summer.

@richardjgowers
Copy link
Member

TRZ doesn't need to store offsets as the frames are all of equal size, I assume this is how DCD works too?

Uncompressed ASCII formats, (PDB, DL_Poly, XYZ, anything else?) can all save their offsets as they read (this is already done for DL_Poly, just it doesn't store them)

Compressed ASCII might be trickier as Oliver said, but should be possible...... I think the annoying thing is the compressed/uncompressed both go through the same Reader class so any solution needs to handle them immediately.

@orbeckst
Copy link
Member Author

@richardjgowers : DCD frames are of fixed size. I think there's a complication with fixed coordinates and perhaps the first frame is different but judging from the comments in the code, fixed coordinates don't work anyway (and I've never used them).

@orbeckst
Copy link
Member Author

Note the discussion at #441 regarding the best format to store the offsets on disk. Currently the best solution (according to benchmarks) appears to be numpy.savez.

@kain88-de
Copy link
Member

Btw Numpy is on the same order of magnitude once you actually read the data. Just doing the dump doesn't read anything. But I'll stick with it because it can handle numpy arrays nicely

https://gist.github.com/kain88-de/5487e9d9c8f7d0e4995b

@orbeckst
Copy link
Member Author

orbeckst commented Jul 10, 2018

The new continuous chain reader code PR #1728 assumes that all readers can do random access, i.e.,

u.trajectory[frame]

I vaguely remember that @kain88-de said in a recent discussion that all our readers can do random access but I cannot find his comment for linking and perhaps I misremember it. EDIT: PR #1086 fixed issue #1081 ("sliced iterators") and it seems it adds random access to any multiframe readers that were not able to do it.

Before closing this issue I'd like to briefly hear from @MDAnalysis/coredevs if I am overlooking something here. Can all our readers do fast random access nowadays? Do we test this anywhere?

And if it is true, we should make sure the docs mention it! (#1827)

@richardjgowers
Copy link
Member

I think it's true yes. The Readers that didn't use to do this were the ascii based ones, but I hacked those together. I guess it wasn't added to the API because I didn't want it to become a hard requirement even if it was true at the time?

We've also got stream readers, which I've never used, and I'm not sure they can do seeking, I wouldn't imagine so..

@orbeckst
Copy link
Member Author

Just saw your comment while I updated mine with your PR #1086.

@orbeckst
Copy link
Member Author

I think we can then close this issue. We certainly don't need offset files for all formats.

(We should require any file-based readers to be able to do fast random access, just so that we or rather users are not suddenly experiencing abysmal performance for one trajectory format. But we can discuss this separately.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants