Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contexts to guess attributes better (especially elements and masses) #2630

Open
lilyminium opened this issue Mar 16, 2020 · 11 comments
Open
Labels

Comments

@lilyminium
Copy link
Member

lilyminium commented Mar 16, 2020

I started writing this as a comment in #2553 but it got, wow, really long. Let me know if I should move it there to carry on the discussion, or to #598.

Is your feature request related to a problem?

I think any effort to work with element-dependent stuff like finding hydrogens from H-bond donors (#2521) will be hampered by the element/type guesser, which is not good (...noting that I wrote the current version). Workarounds like looking at mass only work if the mass is not derived from an incorrectly guessed element. Few file formats directly provide element information, so it needs to be guessed from other information. In practice elements are only guessed from atom names, leading to all sorts of fun results.

Here I round up a bit of what has been discussed in the past and (re-)propose how it could be improved. This is more of an initial idea to gather suggestions than a real solution. Please let me know what you think!

Summary

The problem of guessing elements (and topology attributes in general) has come up many times (#598, #942, #1808, #2331, #2348, #2364, #2553). Some non-comprehensive history of discussion and current state of affairs:

In #2553 there has been interesting discussion of how to guess elements appropriately, and what we can infer from them (e.g. element <-> mass is no longer so straightforward, given HMR systems).

Describe the solution you'd like

We need a better element guesser (which then results in better guessing of related properties). With that solution, (imo) we should consider these:

And these would be nice to have:

  • be easily callable after Universe/topology creation from files
  • be easily modifiable to return results for atomistic vs coarse-grained vs HMR vs any other funky new system
  • modifications are easily saved to file / repeatable
  • doesn't give me a million warnings

Proposal

This is just @jbarnoud and @mnmelo's proposal but with more spitballing --

  • Add a Context class:

    • Basically a database for straightforward element-type-mass-name-radii-etc relationships. Something like a Pandas dataframe rather than dicts, to support looking up any attribute by any other attribute, e.g.

      Name Element Mass Type
      CA C 12 C
      CA Ca 40.087 Ca
    • Should be easily subclassed or otherwise modified by users. Users should be able to add new attributes (columns) to the table, and attribute combinations (rows), with minimal fuss

    • Should be read/writable to file

      • another thing for interoperability -- would be nice if we could read force fields from the MD software formats
    • Registered with the same metaclass trick as Parsers and Readers

  • Add more flexible guesser methods

    • one guesser method per attribute, but accepts arbitrary attribute information (i.e. not just names)
    • guessing of certain attributes (e.g. elements) is still done within individual parsers but users can pass in the context
    • Warnings are implemented inside the guesser methods so we don't have to remember to put them in each parser.
    • guesser methods guess a list of values (i.e. guess_elements vs. guess_element), so we only get the warning once, and it's probably faster

These could either be part of the Context class, be their own class, or (my favourite) be class methods of TopologyAttr. I don't think bundling them with Context is so helpful because most of the CG/HMR issues can be solved by just changing the values in the table. This also means we don't have to validate categorical values like elements ourselves, we just look them up in the table.

API

An API that matches add_TopologyAttr would be convenient for the user:

    >>> u = mda.Universe(PSF, DCD)  # <-- just a protein
    >>> u.guess_TopologyAttr('elements', from_attr=['names'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['Ca', 'Ca', ..., 'Ca'])
    >>> u.guess_TopologyAttr('elements', from_attr=['masses'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['C', 'C', ..., 'C'])

and developers when working with Parsers, which all contribute different information:

    el = Elements.guess_from(names=[...], resnames=[...])  # <-- no need to write guessed=True

Additional context

Below is an example implementation that could work well with that desired API. It is far from the real thing, just to show what I'm thinking.

Context class that contains the data and looks up close matches and stuff

  • able to look up any attribute from any combination of other attributes

  • can return close matches, again for any attribute, from any other attribute. e.g. get the element from a close match to the mass

  • would be nice: matching ranges of values (e.g. HE atom is H if mass is < 4)

  • may need different matching methods based on the type of the value passed in

class Context(metaclass=ContextMeta):
    
    df = pd.DataFrame(my_data)
    name = 'base'
    
    def lookup(self, attrname, strict=False, **from_attrs):
        """Return exact matches for variable number of attributes"""

        # get overlapping info
        info_attrs = [a for a in from_attrs if a in self.df.columns]
        if not info_attrs:
            raise ValueError('No valid guessing information given')

        # set up results
        values = np.array([from_attrs[a] for a in info_attrs]).T
        found = np.array([None]*values.shape[0], dtype=object)

        # group dataframe by info, get the first match for each combination
        columns = info_attrs+[attrname]
        first_df = self.df.groupby(*info_attrs, sort=False).first()
        first_arr = first_df[columns].values

        if strict:
            valid = np.ones_like(values, dtype=bool)  # match all exactly
        else:
            valid = (values!=None) & (first_arr!=None)  # or something like this
                                                        # that caters for charge==0 etc
        masked = values[valid]
        for i, row in enumerate(first_arr[valid]):
            matches = np.all(masked == row[:-1], axis=1)
            found[matches] = row[-1]
        return found


    def find_closest_match(self, attrname, value, return_attr=None, threshold=None):
        """Return close matches or None"""

        if return_attr is None:
            return_attr = attrname

        options = self.df[attrname].values
        if np.issubdtype(options.dtype, np.number):
            i = np.argmin(np.abs(options-value))
            # probably die if difference is greater than a % threshold

        elif np.issubdtype(options.dtype, np.character):
            found_idx = [[]]
            # try different capitalisation
            try_values = [value, value.capitalize(), value.upper(), value.lower()]

            # look for element names in the given value
            while len(found_idx[0]) == 0 and try_values:
                v = try_values.pop(0)
                start = np.char.find(options, value)
                found_idx = np.nonzero(start!=-1)
            if len(found_idx[0]) == 0:
                # not found, return None
                return

            # pick closest match (lowest number of other letters), then earliest
            remaining = np.str_len(options[found_idx]) - len(value)
            r, i_r, n_r = np.unique(remaining, return_counts=True, return_inverse=True)
            if n_r[0] > 1:
                lowest_indices = i_r[0]
                start_i = np.argmin(start[found_idx[lowest_indices]])
                i = found_idx[lowest_indices[start_i]]
            else:
                i = found_idx[i_r[0]]
        else:
            warnings.warn('What other dtypes does MDAnalysis have?')
            return

        return self.df[return_attr].values[i]

Example guessing method for Elements

This is actually pretty general and I guess there could be some base method where you pass in match_exact and match_similar. It takes both instantiated TopologyAttr objects and keyword-named arrays, and returns an Element instance.

@classmethod
def guess_from(cls, *args, context='base', **kwargs):
    info = dict((a.attrname, a.values) for a in args)
    for attrname, values in kwargs.items():
        info[attrname] = np.asarray(values)

    if not info:
        raise ValueError('Nothing to guess from')

    my_context = _CONTEXTS[context]

    n_atoms = len(next(iter(info.values())))
    values = np.array([None]*n_atoms, dtype=object)
    missing = np.ones_like(values, dtype=bool)
    info_used = set()

    # exact types > identifying masses/radii > names
    match_exact = [('types',),
                   ('masses', 'radii'),
                   ('names', 'resnames')]

    while np.any(missing) and match_exact:
        names = match_exact.pop(0)
        attr_info = dict((k, info[k][missing]) for k in names if k in info)
        if attr_info:
            found = my_context.lookup('elements', **attr_info)
            valid = found!=None
            if np.any(valid):
                values[missing][valid] = found[valid]
                info_used |= set(names)
                missing = values == None

    match_similar = ['names', 'types']

    while np.any(missing) and match_similar:
        name = match_similar.pop(0)
        if name in info:
            to_find, idx = np.unique(info[name][missing], return_inverse=True)
            matches = [my_context.find_closest_match('elements', n) for n in to_find]
            matches = np.array(matches)[idx]
            valid = matches!=None
            if np.any(valid):
                values[missing][valid] = matches[valid]
                info_used.add(name)
                missing = values == None

    warnings.warn(f'Guessed elements from {info_used}')  # one warning only
    if np.any(missing):
        warnings.warn(f'Some elements could not be found and are set to ""')
    return cls(values, guessed=True)
@lilyminium lilyminium changed the title Add contexts to guess attributes better (specifically elements and masses) Add contexts to guess attributes better (especially elements and masses) Mar 16, 2020
@jbarnoud
Copy link
Contributor

I did not look at the code, but I find the proposal sound. Guessers have always been what frustrates me the most in mdanalysis.

One thing to be aware of is that some contexts will require maintenance and careful versioning. I think especially about a context for the martini coarse grained force field : to guess masses from a gro file, you need your context to know the mapping for all the residues in your universe. Yet, those get updated from time to time. Similar issues will come with united atom force fields.

The context should be as easy as possible to create, extend, and modify without editing the code of mdanalysis.

@lilyminium
Copy link
Member Author

@jbarnoud Thanks for having a look!

One thing to be aware of is that some contexts will require maintenance and careful versioning.

Yes, this is why I think being able to read from a file (preferably straight from a force field file) would be important. In that scenario the user would ideally be to be able to override whatever masses are generated from the parser with:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

perhaps prompted by a helpful warning message.

I don't think it would be practical to try to keep up with force field updates.

@richardjgowers
Copy link
Member

I like this idea. To play devil's advocate, I'm not keen on adding more lines to load a Universe - part of what the package does is try and hide complexity, ie masses are magically guessed often. But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

@lilyminium
Copy link
Member Author

lilyminium commented Mar 16, 2020

@richardjgowers as masses is a mandatory attribute, parsers would guess it themselves. guess_TopologyAttr just overwrites whatever values previously existed. For magical guessing, if we write the mass guesser to also consider residue names when present:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro', context='my_martini')

Under the hood, GROParser dumps all the available topology attributes into Masses.guess_from, and Masses.guess_from then chooses which ones to use and how. Probably in order:

-> match elements if valid, else
---> match types if valid, else
-----> match atom names [and resnames if present], else
-------> 0.0

If the user chooses not to load a new context, the result would wind up being the status quo with masses of 0.0 (assuming MARTINI particles aren't in base.)

>>> u = mda.Universe('martini.gro', context='base')

Edit: I agree that masses are important and should be guessed, and this may also apply to elements -- but I think a way to easily guess using different information to the tables in guessers, and to easily overwrite the attribute with a guess that the user manually makes, would be really convenient.

@richardjgowers
Copy link
Member

So would it make sense for the context to do the guessing?

ie

context = Context.from_files(...)
masses = context.guess_masses(topology)  # modifies topology in place?

Where the default .guess_masses is just our current function

@xiki-tempula
Copy link
Contributor

xiki-tempula commented Mar 16, 2020

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.

for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36

@jbarnoud
Copy link
Contributor

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

I really like that. I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

@lilyminium
Copy link
Member Author

So would it make sense for the context to do the guessing?

It could! I left it as a classmethod because I like x.from(y), but it’s probably easier to subclass and modify as a Context method. However my method allows users to match values that are not in the topology by passing it in themselves, which I thought might matter in some way.

I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

Sure. I would leave the name as an option just because it follows the topology format and because MDAnalysis might like to offer some default contexts.

@jbarnoud
Copy link
Contributor

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.

for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36

This is not completely accurate. The "small" beads used (mostly) in rings have a mass of 45 amu. You need to know the topology of the residue to know which beads are small. Added to that, the upcoming version of the force field has 3 bead sizes with different masses and many more molecules. But it is just an example, other cg force fields do have different masses for each particles.

@zemanj
Copy link
Member

zemanj commented Mar 16, 2020

tl;dr: We should not guess anything by default.

I'd like to take up what @richardjgowers said:

But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

In my opinion, this is not a "maybe". It may appear like a nice convenience feature if properties can be guessed automatically. However, there is no scientifically sound way to get around actually providing the correct numbers to obtain reliable data. Results obtained with the help of MDAnalysis are regularly published in scientific papers. Any such result must not rely on educated guesses but on facts. And guessing masses is not "deriving". It is guessing. After all, we cannot know for sure which masses/charges/whatever were used in the simulation unless the user provides this information.
Assuring that people do their research properly is certainly not our responsibility as MDAnalysis developers. However, it should be our goal to enable people to do things correctly. In that respect,
guessers are a neat feature, but from a scientific standpoint they're rather dangerous.
Don't get me wrong, I think we should definitely have smart guessers. I just think they really should be disabled by default, and we should take great care letting the user know that guessed properties must be carefully checked.

@orbeckst
Copy link
Member

@lilyminium can we close this issue with #3704 merged (and raise individual ones for specific guessers) or should this remain open as an omnibus issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants