Removing guess by default in TOPParser #2651

IAlibay · 2020-03-29T15:22:58Z

Is your feature request related to a problem?

So a mix between a bug and feature request.
In light of #2630, especially #2630 (comment) I went back and looked at the element parser for the TOPParser. I noticed that not only is it by default guessing elements (using guess_types if the ATOMIC_NUMBER record isn't provided), but it also reports bad elements for two of our test files (NA is reported instead of N for PRM and PRM7).

Describe the solution you'd like

I would like to remove the guess by default functionality, and make it so that elements are only assigned if a full ATOMIC_NUMBER record is present (possibly assigning "empty" records when some of the elements are partially missing [i.e. if the ATOMIC_NUMBER contains negative values].

Guessing can then be done a later user-defined point if necessary.

Additional context

I am complicit in this "bug" so I'm willing to fix this, but we probably need a consensus on what an "empty" element record would look like (this might be agreed upon and I've missed the discussion, pinging @lilyminium who would probably know better here).

Update

So the way I see it there are three things that probably need to be done here:

Remove guess by default for elements in the TOPParser (this should also partly fix EP handling in parm7 files #2449 [at least users will be told that it wasn't recognised as P]
Add use of the atomnum attribute to the TOPParser

The text was updated successfully, but these errors were encountered:

IAlibay · 2020-03-29T15:55:14Z

To add context here, I suddenly realised that AMBER does offer a united atom force field (ff03ua) [as far as I am aware no CG though?]. It looks like the default here is to set the ATOMIC_NUMBER to the element of the heavy atom (i.e. C in CH3). See the following parm7 snippet fo alanine dipeptide:

%FLAG ATOM_NAME
%FORMAT(20a4)
CH3 C   O   N   H   CA  HA  CB  C   O   N   H   CH3
%FLAG CHARGE
%FORMAT(5E16.8)
  1.04557735E+00  8.25646946E+00 -9.85343539E+00 -7.37589504E+00  5.36238555E+00
 -1.57422450E-01  1.91674907E+00 -2.08827558E-02  1.03907928E+01 -1.01157272E+01
 -8.77327211E+00  5.21844761E+00  4.10619486E+00
%FLAG ATOMIC_NUMBER
%FORMAT(10I8)
       6       6       8       7       1       6       1       6       6       8
       7       1       6
%FLAG MASS

I'm not a united atom FF user, so I don't know if using the ATOMIC_NUMBER to get the elements would be problematic in this case. Thoughts?

zemanj · 2020-03-29T16:29:22Z

@IAlibay Yeah that looks like a tricky one. I don't know whether there exists a convention on defining atomic numbers for united-atom groups. Doing so seems just blatantly wrong to me, but here's my two cents on the topic:

I think there is no more exact way of defining an element than by supplying its atomic number. Thus, the AMBER ff03ua file you pasted claims that CH3 groups are actually carbon atoms (which they clearly aren't). Nevertheless, whoever wrote this force field decided that it's a good idea to do so, and I think that is what matters. If I decide to feed a water topology to MDAnalysis and claim that all my atoms are Darmstadtium, I want MDA to happily follow along.

Atomic numbers are unambiguous information, so there is no need to guess anything here. Whether we want to offer consistency checks is probably a different issue.

lilyminium · 2020-03-29T22:29:36Z

I agree with @zemanj re: validation. PDB elements are validated in #2553 which is necessary for the MDAnalysis paradigm for now, but if/when the guessing system is overhauled, we should take the input as canon. If I've gone to the trouble of labelling some of my atoms 'X' in the element column, I would want that information in MDAnalysis as well. If you squash and replace invalid values with an empty record when you read the file in, then you inhibit later guessing/analysis/postprocessing because the guesser no longer has access to the original info, e.g. in the hypothetical situation where an atom number of 0 means something different (e.g. virtual sites) to an atom number of -1 (e.g. CG bead). Possibly we could put the atomnum attribute to better use here.

what an "empty" element record would look like

I think the MDAnalysis has been using the empty string so far.

- Fixes #2699 - Changes made in this Pull Request: - Removes MDAnalysis.topology._elements.py (Remove `NUMBER_TO_ELEMENTS`) - Switches `NUMBER_TO_ELEMENTS` to `Z2SYMB` in TOPParser - Adds a more thorough elements test and starts cleaning up tests in anticipation of #2651

IAlibay added Format-Amber Component-Topology labels Mar 29, 2020

IAlibay mentioned this issue May 27, 2020

Remove NUMBER_TO_ELEMENTS #2700

Merged

4 tasks

IAlibay mentioned this issue Jun 6, 2020

Removes element guessing for TOPParser #2714

Merged

4 tasks

lilyminium closed this as completed in #2714 Jun 12, 2020

IAlibay mentioned this issue Jun 15, 2020

Adding atomic number attributes to topology readers #2758

Open

IAlibay mentioned this issue Sep 2, 2020

Remove element guessing by default in ParmedParser #2933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing guess by default in TOPParser #2651

Removing guess by default in TOPParser #2651

IAlibay commented Mar 29, 2020 •

edited

Loading

IAlibay commented Mar 29, 2020

zemanj commented Mar 29, 2020

lilyminium commented Mar 29, 2020

Removing guess by default in TOPParser #2651

Removing guess by default in TOPParser #2651

Comments

IAlibay commented Mar 29, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Additional context

Update

IAlibay commented Mar 29, 2020

zemanj commented Mar 29, 2020

lilyminium commented Mar 29, 2020

IAlibay commented Mar 29, 2020 •

edited

Loading