Lazy index building for XTC and TRR tajectories #3793

jbarnoud · 2022-08-26T06:15:29Z

Is your feature request related to a problem?

Reading an XTC or TRR trajectory for the first time can be very slow, especially for large files. This is due to the building of the index that allows us to access random frames.

We experienced issues due to the indexing with very large files (#3789) or with read-only file systems where the index had to be rebuilt each time.

Describe the solution you'd like

In most cases, however, we do not randomly access random frames. Instead, we iterate over the trajectory in order. There we do not need the index, so the overhead due to the first index building is unnecessary. We could build the index as we go by saving the offset when we encounter a new frame as we read the file.

This could cause issues with the parallel reading of trajectories. It could also make the reading slower in some cases, which is already an issue (#3731).

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Describe alternatives you've considered

Not change anything. The current method works in most cases. However, large files are becoming more and more common.
Not build the index. That would indeed save the overhead. It would also break some assumptions in the current code base and prevent users from accessing some features when the index is not built.

Additional context

orbeckst · 2023-11-02T19:46:08Z

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Do XTC and TRR contain reliable information about the total number of frames (and total time) in the last (readable) frame?

jbarnoud · 2023-11-04T11:34:03Z

I do not know how @Tsjerk's magic trick really works and how reliable it is. In my experience, it was fairly reliable, but I was only using it on very typical trajectories that had the same properties as the ones for which Tjerk wrote the script. I might still have the script somewhere in my archives somewhere?

Tsjerk · 2023-11-04T15:09:42Z

In an XTC file, the header contains the size of the coordinate section in bytes, and the header itself has a fixed size and starts with the magic number 1993. For indexing (and reversing) an XTC file, I use this to jump through the file with relative seek calls. This works well. I thought MDAnalysis had already adopted that trick. I have been playing around with guesstimating the positions of all frames from the first and last frame sizes. Whether this is reliable depends on the trajectory. If, e.g., PBC jumps are removed, then the drift in size of coordinates causes a drift in the size of the frames, making it hard to get proper guesstimates. For TRR the case is different. These have no compression, so sizes are fixed. However, a frame may contain coordinates, velocities, forces, in any possible combination. That means that frame sizes may be different throughout te file, and it is really hard to determine the frame positions in the file through a simple calculation. Again, this means that stepping through the TRR with fseeks is the most straightforward; the header contains all information required to determine the next frame position. For both XTC and TRR this thus means: find frame start, read header, determine position of next frame, fseek, repeat.

Tsjerk · 2023-11-04T15:14:29Z

Oh, reading through the original issue: 1. Finding the last frame is done by reading the first and saving the header tag (8 bytes: magic number and number of atoms), determine the size of a frame, then going to a position estimated 3/2 frames from the end, reading in everything up to the end and rfind the tag. 2. For parallel processing, the different nodes may be given filepointers spaced throughout the file, find the first frame/header and from there traverse up to the position where another worker started.

jbarnoud added Format-Gromacs Component-Readers labels Aug 26, 2022

jbarnoud mentioned this issue Aug 26, 2022

hanging while reading in gromacs trr trajectory #3789

Open

orbeckst added usability performance labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy index building for XTC and TRR tajectories #3793

Lazy index building for XTC and TRR tajectories #3793

jbarnoud commented Aug 26, 2022

orbeckst commented Nov 2, 2023

jbarnoud commented Nov 4, 2023

Tsjerk commented Nov 4, 2023

Tsjerk commented Nov 4, 2023

Lazy index building for XTC and TRR tajectories #3793

Lazy index building for XTC and TRR tajectories #3793

Comments

jbarnoud commented Aug 26, 2022

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

orbeckst commented Nov 2, 2023

jbarnoud commented Nov 4, 2023

Tsjerk commented Nov 4, 2023

Tsjerk commented Nov 4, 2023