topology keys datatype #455

jmbuhr · 2024-07-03T10:22:52Z

So.... I have the urge for a "small" refactor that nonetheless touches most parts of kimmdys code and need your (@KRiedmiller @ehhartmann ) input if this is a good idea.

Our RecipeSteps use atom ixs (=0-based, integers) internally (but accept ids (1-based, str) or ixs on creation) and also print out using ixs. This make it consistent with e.g. MDAnalysis. But our topology keys for atoms, bonds etc. are 1-based strings, which aligns with gromacs atom ids and made sense at the time. But as we found out later, those atom ids from the file can lie (fixed-width format...) and the actual atom id just depends on the line number. Using integers as keys will reduce the memory footprint of the topology considerably, making it a lot easier to work with a) very large systems like our big collagen systems or b) systems where the solvent needs to be reactive (i.e. also many atoms in the topology). To keep it consistent these keys should also be 0-based. And when you see a recipe like
kimmdy.3_apply_recipe INFO: Recipe: 283⚡285 22574⚡22575 22574⚡22576 22574➡283 22575➡285 22576➡285
in your log you can then use those numbers directly as ixs on the topology.

This is of course a bigger change than simply making the RecipeSteps print out ids instead of ixs, but it makes it overall more consistent.

The text was updated successfully, but these errors were encountered:

KRiedmiller · 2024-07-03T10:57:27Z

I'm in favor of making it consistent everywhere, using ints, and zero basen indexing. 👍
Is it strictly necessary atm? Did you come across a case where this became an issue?

jmbuhr · 2024-07-03T11:33:04Z

Not strictly, but if we do I would like to address it sooner rather than later, because it significantly reduces mental overhead and misunderstandings when writing and testing reaction plugins as well.

It's an issue with the very large collagen systems.

ehhartmann · 2024-07-03T12:40:36Z

With 0 based ints, one would have to convert every time something is written to the topology, right?
Could you explain a bit more how exactly you would refactor the atom identifiers?

jmbuhr · 2024-07-03T13:36:54Z

no, the ints would be in the keys, so the content like e.g. atom.nr is still the gromacs atom number and thus a string. Refactoring will be fairly straightforward, just change the dictionary types and follow the screaming typechecker :)

jmbuhr · 2024-07-03T13:57:21Z

The tradeoff will be at points where the location of an item is inferred from the content of another item.

e.g. Atom(nr='1', type='HC', resnr='828', residue='ACE', atom='HH31', cgnr='1', charge='0.11 23', mass='1.008', typeB=None, chargeB=None, massB=None, bound_to_nrs=['2'], is_radical=False) will be at ix 0 in top.atoms so in order to make it easier to find e.g. atoms of a bond it would need the ix versions of certain properties, that are used internally but not written to topology files. Like how a bond has the strings (for gromacs) ai and aj, which refer to one-based atom ids, would be at key (int(ai) - 1, int(aj) - 1) in top.bonds. So the potential for off-by-one errors is not gone after this change, rather moved to different spots. They should be easier to spot by type checking though.

I'll have to just try this to see what parts are actually affected.

jmbuhr · 2024-07-05T11:52:26Z

On nth thought, there is not solution to this that is entirely consistent (which is deeply unsatisfying) with all libraries involved and internally. Take the following example of iterating over bonds and e.g. checking properties of atoms involved in the bond:

        for b in top.bonds.values():
            ai = top.atoms[b.ai]
            aj = top.atoms[b.aj]

In a bond we have ai and aj as the atom number (=groamcs id, 1-based), because that is also used for writing the line out to a top file (like @ehhartmann said). This means they have to either

be gromacs ids, meaning they couldn't be used for 0-int-based keys
or be ixs, in which case they would have to be decremented/incremented and type-converted when parsing and writing gromacs top files
be 1-based integers, which would equally confuse all interfaces and make it impossible to distinguish what you are dealing with based on the type alone

jmbuhr · 2024-07-05T14:12:40Z

So maybe we just stick with 1-based-str for now and change the way RecipeSteps print out to also use those for consistency.

KRiedmiller · 2024-07-12T14:09:22Z

I like the second variant, setting the boarder between uncertainty and known consistent types as file vs in-memory.
The impact on read/write speed should be minimal I guess.

This way numbers behave like numbers should, and there is no ambiguity within our code.

However, this is probably not needed for release 2.0, as it should not change how the user interfaces with KIMMDY.

jmbuhr · 2024-07-12T14:17:19Z

Agreed, I like the clear interface.

KRiedmiller added the enhancement New feature or request label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topology keys datatype #455

topology keys datatype #455

jmbuhr commented Jul 3, 2024

KRiedmiller commented Jul 3, 2024 •

edited

Loading

jmbuhr commented Jul 3, 2024

ehhartmann commented Jul 3, 2024

jmbuhr commented Jul 3, 2024 •

edited

Loading

jmbuhr commented Jul 3, 2024

jmbuhr commented Jul 5, 2024 •

edited

Loading

jmbuhr commented Jul 5, 2024 •

edited

Loading

KRiedmiller commented Jul 12, 2024

jmbuhr commented Jul 12, 2024

topology keys datatype #455

topology keys datatype #455

Comments

jmbuhr commented Jul 3, 2024

KRiedmiller commented Jul 3, 2024 • edited Loading

jmbuhr commented Jul 3, 2024

ehhartmann commented Jul 3, 2024

jmbuhr commented Jul 3, 2024 • edited Loading

jmbuhr commented Jul 3, 2024

jmbuhr commented Jul 5, 2024 • edited Loading

jmbuhr commented Jul 5, 2024 • edited Loading

KRiedmiller commented Jul 12, 2024

jmbuhr commented Jul 12, 2024

KRiedmiller commented Jul 3, 2024 •

edited

Loading

jmbuhr commented Jul 3, 2024 •

edited

Loading

jmbuhr commented Jul 5, 2024 •

edited

Loading

jmbuhr commented Jul 5, 2024 •

edited

Loading