Trajectoryformatspecification.mk

Trajectory format specification

1.  Release notes
    =============

General notes:
--------------

\* Currently the API is lacking many data retrieval and getter
functions. This will be fixed soon.

\* It might be problematic searching for a specific time if frames are
not consecutive. This problem is somewhat hypothetical, but might be
good to point out.

\* Currently we cannot track a specific molecule in a grand canonical
ensemble.

\* We should request comments on the specs from other groups - after v1.

Erik's comments:

\* Keep the number of calls small

\* Use a prefix for the calls, e.g. tng\_open, tng\_close, etc.; use
e.g. adv\_ prefix for advanced API

\* Help routine to return the full list of atom types

\* In general info:

- number of stride pointers

- for each pointer, the number of frame sets that it skips

=\> have 3 stride pointers with defaults 1, 100 and 10,000 frame sets

\* include zlib as a supported compressor

To do:
------

\* Make a drawing

\* Include API function for signing of the trajectory

\* Sanity checklist:

1.  Block header size
2.  Block contents size
3.  Compare hashes (currently doesn’t abort only prints warning)

1.  maybe add a flag to choose between abort/warning
2.  empty hash is always accepted

4.  

Version 0.98
------------

\* Use signed int instead of unsigned.

\* Changed names from ‘trg’ to ‘tng’.

Version 0.97
------------

\* Added chains and residues to the molecule description.

Version 0.96
------------

\*Renamed “Trajectory Box Shape”, “Trajectory Positions” etc.

\*Renamed “Trajectory Index Block” to “Particle Mapping Block”

Version 0.95
------------

\* Changed name of ‘Atom name block’ to ‘Molecule block’ and included
connectivity in that block. The number of each molecule is also
specified there, unless using variable number of particles (grand
canonical ensemble).

\* Removed ‘variable number of particles’ and ‘variable number of
values’ flags from data blocks.

\* Updated the API and headers

Version 0.9
-----------

\* Removed endianness/string block

\* Added PGP signature to General Info block

\* Removed reserved ‘user data blocks’. Those are already handled as
normal ‘data blocks’. Reserved are only box, positions, velocities and
forces.

Version 0.8

\* Use only MD5 hashes

\* Changed time (in general info block) to 64 bit int.

Version 0.7
-----------

\* New hierarchy rules

\* Changed "Trajectory Info" to "General Info" block

\* Moved "BOX SHAPE" before the index and trajectory frame blocks

\* general, int., bin32, bin64 user data is explicitly not frame
dependent

\* Variable number of atoms will be supported in version 1

\* Made more clear the nesting in the "trajectory group blocks"

\* Allow header only files, i.e. no trajectory blocks

\* Removed number of frames from the "index block" description

\* Removed ASCII recommendation from the string description, i.e. [1]

\* Added molecule ID to “atom name block”

Version 0.6
-----------

Changed endianness block to endianness and string length block.

Changed order of hash specifications in the header block.

Added initial API specifications

Version 0.5
-----------

Included an additional header block that specifies the endianness. It
comes before all other blocks.

2.  Specifications
    ==============

General specifications are given below. Others are described in the
relevant block sections.

1.  The file contains a number of blocks.
2.  The order of the blocks follows the order specified in section 4
    “Description of blocks”
3.  All integers and floating point (floats and doubles) values are
    stored using big endian byte ordering. Conversions to and from the
    native format of the computer is performed during reading and
    writing of numerical fields.
4.  MD5 hashes are used to verify the integrity of the data.
5.  Strings are limited to a max length of 1023 characters and are
    terminated by a null character (‘\\000’). If longer text data must
    be saved a data block containing multiple entries of general
    (character/string) data can be used.
6.  If a trajectory converter program encounters errors during reading a
    block which format is not recognized, the block is to be written out
    as binary object without modification.
7.  Each group of blocks to contain a Table of Contents which lists the
    included blocks within the group.^[[a]](#cmnt1)^
8.  No compression (in ver.1) for general data streams such as integers,
    floating point numbers, particle indices etc.

3.  Each block contains the following fields (header):
    ==================================================

1.  64 bit size of the header
2.  64 bit size of the block contents (except header)
3.  64 bit block type identifier
4.  16 characters MD5 Hash (or 16 “\\0” characters)
5.  name[1]
6.  64 bit version of the block ^[[b]](#cmnt2)^(allows addition of more
    fields in the future to existing blocks, although old fields should
    never be removed, to allow older readers read new files)

4.  Description of blocks (each with a unique 64 bit identifier and a matching "name"):
    ===================================================================================

1.  info block (1) "GENERAL INFO" (required)
2.  molecules block (2) "MOLECULES" (optional)
3.  trajectory ids and names (4) “TRAJECTORY IDS AND NAMES” (optional)
4.  trajectory frames, box shape block (10000) "BOX SHAPE" (optional,
    can be present before the frame sets if it does not change or inside
    the frame sets if it varies)
5.  trajectory frame set block (5) "TRAJECTORY FRAME SET" (required)
    (multiple “trajectory frame sets” are allowed)

1.  trajectory table of contents block (6) “BLOCK TABLE OF CONTENTS”
    (required)
2.  trajectory frames, box shape block (10000) "BOX SHAPE" (optional,
    can be present before the frame sets if it does not change or inside
    the frame sets if it varies)
3.  trajectory particle mapping block (7) "PARTICLE MAPPING" (required
    if there are trajectory frames blocks) (multiple particle mapping
    blocks with corresponding trajectory frames blocks are allowed, e.g.
    to allow parallel writes of different atom sets)

1.  trajectory frames, positions, block (10001) "POSITIONS" (optional)
2.  trajectory frames, velocities, block (10002) "VELOCITIES" (optional)
3.  trajectory frames, forces, block (10003) "FORCES" (optional)

6.  ...other specified blocks, both non-trajectory and trajectory
    blocks, each with unique id & name

Data blocks can be used to store whatever data is needed. Data blocks
with IDs in the range 10000 to 10999 are reserved for standard data
(such as box shape, positions, velocities etc.), whereas IDs from 11000
and above can be used for any kind of user data.

5.  NOTES ABOUT BLOCK KINDS:
    ========================

There can be only one block at the beginning of the file in the standard
case with fixed charges throughout the simulation (that’s the case for
version 1 of the format). For simulations where charges vary each frame
will include a block with the values.

The trajectory frame blocks must be collected into frame sets, each

such frame set has as its first block the "trajectory frame set block".
Each frame set will contain (multiple) particle mapping blocks,
positions, velocities, forces etc.^[[c]](#cmnt3)^

6.  Requirements on block order:
    ============================

1.  The order follows section 4. “Description of blocks” with the
    corresponding nesting of multiple “trajectory frame sets” and
    “particle mapping blocks”.
2.  All non-trajectory frame blocks (e.g. user ones) must appear before
    the trajectory blocks
3.  Trajectory particle mapping blocks are optional. If they are
    present, they must appear before the corresponding trajectory data
    blocks. If there are multiple trajectory data blocks, the
    corresponding particle mapping blocks come right before them. I.e.
    ParticleMappingBlock1-\>DataBlock1-\>P.MappingBlock2-\>DataBlock2.
4.  Blocks within a group of blocks are ordered by their ID.

7.  Other requirements:
    ===================

1.  Most blocks are optional except for the “general info” blocks
2.  No limit on the number of times that trajectory related blocks are
    allowed to appear

8.  Specification of the block contents (all blocks have the same header as described above) for version 1 of each block type.
    ==========================================================================================================================

BLOCK: general info block
-------------------------

1.  name and version of the program used to perform the simulation (upon
    file creation)[1]
2.  name and version of the program used when finishing the file[1]
3.  name of the force field used to perform the simulation
    [1]^[[d]](#cmnt4)^
4.  name of the person who created the file [1]
5.  name of the person who last modified the file [1]
6.  64 bit time of initial file creation, seconds since 1970
7.  64 bit time of completing the simulation, seconds since 1970
8.  name of computer/other info where the file was created [1]
9.  name of computer/other info where the file was completed [1]
10. PGP signature (optional and 0 terminated string)
11. 8 bit flag Use variable number of atoms.
12. 64 bit number of frames in each frame set (this is the expected
    number of frames in each set, but it does not have to be constant,
    it is OK to have frame sets with fewer or more frames, e.g. after
    concatenating multiple trajectory files. This avoids the need to
    recompress all data after a concatenation, but it means that
    searching for a specific frame might need a few more steps between
    frame sets.). For simulations using a grand canonical ensemble it is
    best to set this to 1 so that the number of atoms in the frame sets
    can be updated regularly.
13. 64 bit pointer from the beginning of the info block to the beginning
    of the first trajectory frame set [2]
14. 64 bit pointer from the beginning of the info block to the beginning
    of the last trajectory frame set [2] (updated when finishing writing
    the trajectory file - otherwise set to -1)
15. 64 bit length of steps (number of “trajectory frame set blocks”) for
    long stride pointers (default 100 “trajectory frame set blocks”).

BLOCK: molecules block (optional)
---------------------------------

1.  64 bit number of molecules
2.  For each molecule:

1.  64 bit Molecule ID
2.  Molecule name [1]
3.  64 bit quaternary structure, e.g. 1 means monomeric, 4 means
    tetrameric etc.
4.  64 bit number of molecules of this kind - only if not using
    “variable number of atoms” in the “general info block”.
5.  64 bit number of chains in the molecule
6.  64 bit number of residues in the molecule
7.  64 bit number of atoms in the molecule^[[e]](#cmnt5)^
8.  For each chain:

1.  64 bit Chain ID (unique in molecule)
2.  Chain name [1]
3.  64 bit number of residues in the chain
4.  For each residue:

1.  64 bit Residue ID (unique in the chain)
2.  Residue name [1]
3.  64 bit number of atoms in the residue
4.  For each atom:

1.  64 bit Atom ID (unique in the molecule)
2.  Atom name [1]
3.  Atom type [1]

9.  64 bit number of bonds in the molecule
10. For each bond:

5.  64 bit integer From Atom ID.
6.  64 bit integer To Atom ID.

BLOCK: trajectory frame set block
---------------------------------

1.  64 bit number of first frame (zero based numbering)
2.  64 bit number of frames (NF)
3.  Array of 64 bit integers specifying the count of each molecule type.
    The molecule types are listed in the “Atom names block” and should
    be listed in the same order here. This should only be present when
    the variable number of atoms flag in the “General info block” is set
    to TRUE. This is used for e.g. simulations using a grand canonical
    ensemble (in which case the number of frames in each frame set
    should be 1).
4.  64 bit pointer to the next “trajectory frame set block”.
5.  64 bit pointer the previous “trajectory frame set block”.
6.  64 bit long stride pointer to the next e.g. 100th “trajectory frame
    set block”. (Stride length specified in “general info” block.)
7.  64 bit long stride pointer to the previous e.g. 100th “trajectory
    frame set block”.

BLOCK: trajectory table of contents
-----------------------------------

1.  64 bit number of blocks

Contains a listing of all data blocks \_present\_ in the frame set. It
is possible to have multiple blocks with the same ID, but the ID is only
listed once in the “trajectory table of contents” block.

It includes for each block type:

1.  Block name [1]

BLOCK: data blocks
------------------

Frame dependent data blocks should come after the frame set block to
which it belongs. Frame and particle dependent data blocks should come
after the relevant particle mapping block (if using any particle mapping
block).

1.  Char data type flag. 0 = character/string data, 1 = 64 bit integer
    data, 2 = float data (32 bit), 3 = double data (64 bit)
2.  Char dependency flag. 1 = frame dependent, 2 = particle dependent.
    Can be combined, i.e. 3 = frame and particle dependent.
3.  Char sparse data flag to signify if not all frames in the frame sets
    have data entries in this data block, e.g. energies and positions
    might be saved at different intervals meaning that at least one of
    them would be saved as sparse data. Only present if the data is
    frame dependent.
4.  64 bit number of values.
5.  64 bit id of the CODEC used to store the positions
6.  Double (64 bit) multiplier for integers to obtain the appropriate
    floating point number, for compressed frames [3] [\*\*] (only
    present if the above CODEC id is \> 0 and if the data type is double
    or float)

If using sparse data the following fields are required:

1.  64 bit number of first frame containing data.
2.  64 bit number of frames between data points

Particle dependent data blocks contain the following fields:

1.  64 bit number of first particle as stored in the trajectory, zero
    based numbering) (J), this must be the same as in the preceding
    trajectory particle mapping block, if present.

1.  64 bit number of particles in block, this must be the same as in the
    preceding trajectory particle mapping block, if present.

Example 1:

Box shape block (10000) in a frame set with frames 0-99:

1.  Data type: 3 (double)
2.  Dependency: 1 (frame dependent)
3.  Sparse data: 1
4.  Number of values: 9
5.  Codec ID: 0
6.  First frame containing data: 0
7.  Number of frames between data points: 50
8.  For each frame (2 frames with data in this block):

1.  9 double (64 bit) values describing the shape of the block

Example 2:

Positions block (ID 10001) in a frame set with frames 1000-1099:

1.  Data type: 2 (float)
2.  Dependency: 3 (frame and particle dependent)
3.  Sparse data: 1
4.  Number of values: 3 (x, y and z)
5.  Coded ID: 0
6.  First frame containing data: 100
7.  Number of frames between data points: 10
8.  Number of first particle: 0
9.  Number of particles in block: 1000
10. For each frame (10 frames with data in this block):

1.  For each particle (1000 particles):

1.  32 bit float x coordinate
2.  32 bit float y coordinate
3.  32 bit float z coordinate

Example 3:

Forces block (ID 10003) in a frame set with frames 0-99:

11. Data type: 2 (float)
12. Dependency: 3 (frame and particle dependent)
13. Sparse data: 0
14. Number of values: 3 (x, y and z)
15. Coded ID: 0
16. Number of first particle: 0
17. Number of particles in block: 100
18. For each frame (100 frames with data in this block):

2.  For each particle (100 particles):

1.  32 bit float x coordinate
2.  32 bit float y coordinate
3.  32 bit float z coordinate

BLOCK: particle mapping block
-----------------------------

1.  64 bit number of first particle (particle number as stored in the
    trajectory, zero based numbering) (J)
2.  64 bit number of particles in this particle mapping block (M)
3.  64 bit array of particle numbers^[[f]](#cmnt6)^ (M values):

1.  Each value is the number of the real particle corresponding to the
    particle number as stored in the trajectory.

Should no particle mapping block be present, the mapping is the number

of the real particle == the particle number as stored in the trajectory.

It is possible to have several trajectory/velocities etc. frame blocks
within a frame set, e.g. when faster parallel writes or memory
considerations are needed. In that case a separate particle mapping
block is needed for each of the trajectory/velocities etc. blocks.

Relation between trajectory blocks:
===================================

Particle mapping blocks contain the remapping of actual particle index
and the particle index as appearing in the trajectory file. They are
optional. If they are not given, there is no remapping. All trajectory
blocks for the same set of particles must follow each other, i.e.
positions for particle 0-99, then velocities for particle 0-99, then
positions for particle 100-199, then velocities for particle 100-199.
All non-particle trajectory blocks must appear before any particle
containing trajectory blocks.

Limitations on the number of particles in trajectory frame blocks: In
order to be able to read and uncompress data there must be a limit on
the number of particles in each trajectory frame block, therefore most
trajectory frame sets will contain multiple particle mapping / positions
/ velocities / ... blocks. The limit on the number of particles per
trajectory frame blocks should be XXXX. This should be a good value, and
not allowed to be set by the user, since this may prevent reading of the
files on smaller memory machines.

CODEC specifications (id) "name"
================================

1.  uncompressed (0) "UNCOMPRESSED"
2.  XTC positions (1) "XTC"
3.  TNG (2) "TNG"
4.  …

[](#)

[\*\*] Storage of compressed positions / velocities / ...: These are now
all converted to integers before stored. In order to facilitate
recompression without loss of precision it is essential that these are
visible as integers. Therefore the compression blocks all must contain
somewhere a conversion factor from integer to float.

Notes
=====

[1] UTF-8 text string. Make all text strings zero terminated.

[2] 64 bit pointer format: -1UL (all ones), means "not set", which is
what should be written whenever a pointer needs to be written when the
appropriate value is not yet known, while 0 (all zeros), typically means
the end of the list.

[3] Floating point format is big-endian IEEE-754, float (32 bit) or
double (64 bit).

API
===

(The API should be separated into one high- and one low-level API, using
e.g. a tng\_low tag for the low-level functions.)

API documentation is generated using the -DTNG_BUILD_DOCUMENTATION=ON option
when running cmake. Requires a doxygen installation.


^[[g]](#cmnt7)^

[[a]](#cmnt_ref1)magnus.lundborg:

Currently sizes and offsets are not in the TOC block. I think it needs
further testing to decide if it is good or not.

* * * * *

Sander Pronk:

So how can you find out where the block is?

* * * * *

magnus.lundborg:

If the offsets are not listed in the TOC block you would have to read
the whole frame set, or at least all the block headers in the frame set,
which shouldn't be too bad.

[[b]](#cmnt_ref2)Roland Schulz:

I suggest not to use a version number. This is already a problem with
the tpx version and branches. Instead I suggest to have a bitvector
where each bit says whether a certain feature is present in this file.
Given a central registry of meaning of the bits, this allows different
groups/branches/software to add features. Which would be difficult with
a version number approach which has an inherently linear ordering. The
last bit probably should be reserved to signify whether the 64bit
bitvector is extended by a another 64bit. The data should be stored in
the order of the bitvectors. A reader which doesn't support a certain
bit, cannot read any of the following data if the bit is on.

[[c]](#cmnt_ref3)magnus.lundborg:

We will have a problem if we want to add data to a frame set in a file.
All subsequent frame sets will need to be rewritten. One alternative
would be to have a list of pointers to each block of each block type in
the frame set table of contents block. But we will have a problem adding
rows to that block as well, which in turn could be fixed by having a
pointer from the frame set block to the "current" table of contents
block and just let the old one remain. We could actually have a flag in
block headers to show if the block is "up-to-date". But there is a risk
that these pointers will be slow - especially when it comes to writing.

[[d]](#cmnt_ref4)Rossen Apostolov:

somehow the name of the FF doesn't fit naturally with the rest of the
info here :)

How about including the simulation setup in the file in a separate
block? That will be needed if the file can be used for restarts too.

[[e]](#cmnt_ref5)magnus.lundborg:

This introduces a bit of redundancy, but helps keeping track of the
data.

[[f]](#cmnt_ref6)Roland Schulz:

might be good to make this optional. And if it isn't given then the
numbering is consecutive. The would still give the flexibility that one
can specify the first and no of particles which isn't possible without
index block.

* * * * *

Daniel Spångberg:

if this is made optional, the comment below the section can be removed.
and particle mapping blocks required, since it will not cost much extra
to have it.

[[g]](#cmnt_ref7)Rossen Apostolov:

We should think of a different name for the traj. group blcok, it's
confusing