Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy index building for XTC and TRR tajectories #3793

Open
jbarnoud opened this issue Aug 26, 2022 · 4 comments
Open

Lazy index building for XTC and TRR tajectories #3793

jbarnoud opened this issue Aug 26, 2022 · 4 comments

Comments

@jbarnoud
Copy link
Contributor

Is your feature request related to a problem?

Reading an XTC or TRR trajectory for the first time can be very slow, especially for large files. This is due to the building of the index that allows us to access random frames.

We experienced issues due to the indexing with very large files (#3789) or with read-only file systems where the index had to be rebuilt each time.

Describe the solution you'd like

In most cases, however, we do not randomly access random frames. Instead, we iterate over the trajectory in order. There we do not need the index, so the overhead due to the first index building is unnecessary. We could build the index as we go by saving the offset when we encounter a new frame as we read the file.

This could cause issues with the parallel reading of trajectories. It could also make the reading slower in some cases, which is already an issue (#3731).

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Describe alternatives you've considered

  • Not change anything. The current method works in most cases. However, large files are becoming more and more common.
  • Not build the index. That would indeed save the overhead. It would also break some assumptions in the current code base and prevent users from accessing some features when the index is not built.

Additional context

@orbeckst
Copy link
Member

orbeckst commented Nov 2, 2023

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Do XTC and TRR contain reliable information about the total number of frames (and total time) in the last (readable) frame?

@jbarnoud
Copy link
Contributor Author

jbarnoud commented Nov 4, 2023

I do not know how @Tsjerk's magic trick really works and how reliable it is. In my experience, it was fairly reliable, but I was only using it on very typical trajectories that had the same properties as the ones for which Tjerk wrote the script. I might still have the script somewhere in my archives somewhere?

@Tsjerk
Copy link

Tsjerk commented Nov 4, 2023

In an XTC file, the header contains the size of the coordinate section in bytes, and the header itself has a fixed size and starts with the magic number 1993. For indexing (and reversing) an XTC file, I use this to jump through the file with relative seek calls. This works well. I thought MDAnalysis had already adopted that trick. I have been playing around with guesstimating the positions of all frames from the first and last frame sizes. Whether this is reliable depends on the trajectory. If, e.g., PBC jumps are removed, then the drift in size of coordinates causes a drift in the size of the frames, making it hard to get proper guesstimates. For TRR the case is different. These have no compression, so sizes are fixed. However, a frame may contain coordinates, velocities, forces, in any possible combination. That means that frame sizes may be different throughout te file, and it is really hard to determine the frame positions in the file through a simple calculation. Again, this means that stepping through the TRR with fseeks is the most straightforward; the header contains all information required to determine the next frame position. For both XTC and TRR this thus means: find frame start, read header, determine position of next frame, fseek, repeat.

@Tsjerk
Copy link

Tsjerk commented Nov 4, 2023

Oh, reading through the original issue: 1. Finding the last frame is done by reading the first and saving the header tag (8 bytes: magic number and number of atoms), determine the size of a frame, then going to a position estimated 3/2 frames from the end, reading in everything up to the end and rfind the tag. 2. For parallel processing, the different nodes may be given filepointers spaced throughout the file, find the first frame/header and from there traverse up to the position where another worker started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants