Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology keys datatype #455

Open
jmbuhr opened this issue Jul 3, 2024 · 9 comments
Open

topology keys datatype #455

jmbuhr opened this issue Jul 3, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@jmbuhr
Copy link
Collaborator

jmbuhr commented Jul 3, 2024

So.... I have the urge for a "small" refactor that nonetheless touches most parts of kimmdys code and need your (@KRiedmiller @ehhartmann ) input if this is a good idea.

Our RecipeSteps use atom ixs (=0-based, integers) internally (but accept ids (1-based, str) or ixs on creation) and also print out using ixs. This make it consistent with e.g. MDAnalysis. But our topology keys for atoms, bonds etc. are 1-based strings, which aligns with gromacs atom ids and made sense at the time. But as we found out later, those atom ids from the file can lie (fixed-width format...) and the actual atom id just depends on the line number. Using integers as keys will reduce the memory footprint of the topology considerably, making it a lot easier to work with a) very large systems like our big collagen systems or b) systems where the solvent needs to be reactive (i.e. also many atoms in the topology). To keep it consistent these keys should also be 0-based. And when you see a recipe like
kimmdy.3_apply_recipe INFO: Recipe: 283⚡285 22574⚡22575 22574⚡22576 22574➡283 22575➡285 22576➡285
in your log you can then use those numbers directly as ixs on the topology.

This is of course a bigger change than simply making the RecipeSteps print out ids instead of ixs, but it makes it overall more consistent.

@KRiedmiller
Copy link
Collaborator

KRiedmiller commented Jul 3, 2024

I'm in favor of making it consistent everywhere, using ints, and zero basen indexing. 👍
Is it strictly necessary atm? Did you come across a case where this became an issue?

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 3, 2024

Not strictly, but if we do I would like to address it sooner rather than later, because it significantly reduces mental overhead and misunderstandings when writing and testing reaction plugins as well.

It's an issue with the very large collagen systems.

@ehhartmann
Copy link
Collaborator

With 0 based ints, one would have to convert every time something is written to the topology, right?
Could you explain a bit more how exactly you would refactor the atom identifiers?

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 3, 2024

no, the ints would be in the keys, so the content like e.g. atom.nr is still the gromacs atom number and thus a string. Refactoring will be fairly straightforward, just change the dictionary types and follow the screaming typechecker :)

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 3, 2024

The tradeoff will be at points where the location of an item is inferred from the content of another item.

e.g. Atom(nr='1', type='HC', resnr='828', residue='ACE', atom='HH31', cgnr='1', charge='0.11 23', mass='1.008', typeB=None, chargeB=None, massB=None, bound_to_nrs=['2'], is_radical=False) will be at ix 0 in top.atoms so in order to make it easier to find e.g. atoms of a bond it would need the ix versions of certain properties, that are used internally but not written to topology files. Like how a bond has the strings (for gromacs) ai and aj, which refer to one-based atom ids, would be at key (int(ai) - 1, int(aj) - 1) in top.bonds. So the potential for off-by-one errors is not gone after this change, rather moved to different spots. They should be easier to spot by type checking though.

I'll have to just try this to see what parts are actually affected.

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 5, 2024

On nth thought, there is not solution to this that is entirely consistent (which is deeply unsatisfying) with all libraries involved and internally. Take the following example of iterating over bonds and e.g. checking properties of atoms involved in the bond:

        for b in top.bonds.values():
            ai = top.atoms[b.ai]
            aj = top.atoms[b.aj]

In a bond we have ai and aj as the atom number (=groamcs id, 1-based), because that is also used for writing the line out to a top file (like @ehhartmann said). This means they have to either

  • be gromacs ids, meaning they couldn't be used for 0-int-based keys
  • or be ixs, in which case they would have to be decremented/incremented and type-converted when parsing and writing gromacs top files
  • be 1-based integers, which would equally confuse all interfaces and make it impossible to distinguish what you are dealing with based on the type alone

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 5, 2024

So maybe we just stick with 1-based-str for now and change the way RecipeSteps print out to also use those for consistency.

@KRiedmiller
Copy link
Collaborator

I like the second variant, setting the boarder between uncertainty and known consistent types as file vs in-memory.
The impact on read/write speed should be minimal I guess.

This way numbers behave like numbers should, and there is no ambiguity within our code.

However, this is probably not needed for release 2.0, as it should not change how the user interfaces with KIMMDY.

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jul 12, 2024

Agreed, I like the clear interface.

@KRiedmiller KRiedmiller added the enhancement New feature or request label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants