Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for converters #2790

Closed
cbouy opened this issue Jun 25, 2020 · 10 comments · Fixed by #2882
Closed

API for converters #2790

cbouy opened this issue Jun 25, 2020 · 10 comments · Fixed by #2882

Comments

@cbouy
Copy link
Member

cbouy commented Jun 25, 2020

Is your feature request related to a problem?

Currently, converters are called with the convert_to method, e.g. u.atoms.convert_to('PARMED'). This requires users to:

  1. Check the docstring for which converters are available
  2. Use all caps for the converter name

Describe the solution you'd like

Add to_lib methods for all converters, e.g. u.atoms.to_parmed(). This allows users to use auto-completion to quickly check for which converters are available after typing to_.
Optionally, allowing u.atoms.convert_to('parmed') would be great as it feels more natural

Describe alternatives you've considered

u.atoms.parmed(), i.e. pytorch style, but it's less useful for auto-completion

Additional context

Discussed in PR #2775 (more technically here)
Currently concerns the ParmEd converter, and will also concern the RDKit converter in 2.0.0

Tagging @lilyminium @RMeli @richardjgowers @IAlibay

@orbeckst
Copy link
Member

According to the docs https://docs.mdanalysis.org/2.0.0-dev0/documentation_pages/converters.html , the argument of convert_to() is case-insensitive so atoms.convert_to("parmed") should work.

Should it be possible to do something like atoms.convert_to(mda.converters.ParmEd.ParmEdConverter), too (assuming that the converter actually lived in the converters package) for a fully object-oriented design?

I know that TAB-completion is nice but I am also worried about adding more and more methods onto AtomGroup. (We got rid of "instant selectors" #1377 which were nice for interactive work but more confusing and hard to maintain in the long run... although they also had other problems that warranted their removal.) My initial gut reaction is to not proliferate to_*() methods but making sure that the docs for convert_to() are complete so that convert_to? is useful for interactive exploration. However, that's just my opinion so please speak up.

@orbeckst
Copy link
Member

More generally speaking: laying down an API for the converters 👍 👍

@lilyminium
Copy link
Member

lilyminium commented Jun 25, 2020

Thanks for starting this @cbouy! @Luthaf not sure if you were planning on taking chemfiles in a converter direction; if so, you might also be interested in this discussion.

the argument of convert_to() is case-insensitive so atoms.convert_to("parmed") should work.

It should be (and it's an easy fix), but currently the function to get the right converter just passes in the string un-uppercased.

writer = _CONVERTERS[format]

making sure that the docs for convert_to() are complete so that convert_to? is useful for interactive exploration

I'm inclined to favour any method that doesn't rely on us remembering to do things manually.

One solution to this as proposed by @RMeli is to add on a bunch of to_ methods. I don't immediately see a lot of downsides to this other than cluttering up namespace, but it could have complications down the line if we want to convert things inside MDAnalysis (e.g. adding a u.atoms.to_ringgroup()). The normal cons of having to maintain separate methods with a consistent API are minimised because we could add them via the metaclass. One minor quibble is that if an object has a to_parmed() method I typically expect it to have a from_parmed() classmethod too; but you could say the same about convert_to()(and even atoms.write() vs atoms.read()).

Another solution is to stop hiding the registries in __init__. Then you can look at the available converters just with mda.CONVERTERS.keys(). This also solves similar issues with the available topology parsers and readers. You could even look at all available TOPOLOGY_ATTRS instead of bookmarking the user guide 😁 (which uses these registries to document them anyway). One downside of this is that users can mess with the registries with unforeseen consequences, but they could just reload the session to undo any unwise changes. This could also be a plus; chemfiles can read files directly, and users could customise which files they want to be read by chemfiles vs another parser by fiddling with the registry.

Finally a hacky solution is to some kind of wrapper class that we attach the to_lib methods too, i.e. atoms.convert.to_lib(). This mitigates the concerns about too many methods in AtomGroup. I'm not a fan of this idea though; it's inconsistent with atoms.write() and sounds complicated.

Personally my favourite direction is to stop hiding registries because it's easy and general, but would love to get everyone's thoughts.

@orbeckst
Copy link
Member

It seems that most people here would go for to_*() methods (numpy-style). ✅ fine with me, I can get behind the reasons (here and #2775 ) , especially if we can do it consistently with magic (I mean, metaclasses/registry).

If we define an API for the registries then we can make them public. If we don't have one/don't want to settle on one then we should keep them private. (I predict that even public registries are still going to be a feature for the small percentage of power users and the majority of users would rather just work with u.atoms.<TAB>....)

@jbarnoud
Copy link
Contributor

I did not see this issue before. I hope it is still time to chip in.

I am not a big fan of cluttering the atom group namespace. It is already very much cluttered with dynamically added attributes that end up not appearing in the doc. While the converters would most likely appear in the doc more easily, I do not think it would be wise to open the gate to even more attributes.

I think the ag.convert_to had the benefit of simplicity. Though I do understand the discoverability issue, it has my vote.

If having tab-completion is really a must-have, then I would go for the wrapper approach that @lilyminium suggested. ag.convert_to.parmed(...) has the benefit of not cluttering the AtomGroup namespace, and has the ability do be tab-completed. It also allows the converter to be grouped in a separate part of the documentation rather than having a messy phone book as the documentation of the AtomGroup.

@lilyminium suggested that the separate namespace was inconsistent with the write method. If really it is an issue, then we can have a atoms.write_to wrapper parallel to the write method. Then we can do things such as atoms.write_to.pdb(...).

Implementing the wrapper is not that complicated. The ConvertToNamespace class does not have to be much more than a namespace that knows about the AtomGroup that instantiated it. It also means that it has the responsibility of gathering the converters it has to expose, instead of having the AtomGroup do it on top of all it already has to do.

One more problem with cluttering the namespace of AtomGroup is that it makes tab-completion less and less useful. The more attribute we have, the less visible each attribute becomes.

Some time ago, we were discussing having all the box related operations under a atoms.pbc namespace, and all the centres of something under their own namespace. There too, the concern was by having everything flat under the AtomGroup we were making things less discoverable.

@orbeckst
Copy link
Member

orbeckst commented Jul 23, 2020

AtomGroup is becoming a bit like pandas.DataFrame so I looked there for other examples in the Python eco system.

pd.DataFrame has a gazillion of to_* methods such as .to_json(), .to_csv(), etc. so having the AtomGroup.to_* methods would be somewhat comparable (even though our file writers use AtomGroup.write() and not AtomGroup.to_file() (which we could add but sounds a bit silly to me... I like write() as a clear verb)).

On the other hand, their pd.DataFrame.plot method can specify the type of plot by either the kind="scatter" kwarg or through a wrapper-like approach pd.DataFrame.plot.scatter. This would be akin to AtomGroup.convert("rdkit") and AtomGroup.convert.rdkit co-existing.

From the thread above I would summarize that our UI should allow for the following (and maybe not all are possible)

  • be semantically clear
  • allow for TAB completion
  • do not clutter the AtomGroup namespace
  • generate documentation
  • be easily extensible with new convertors (i.e., make use of a registry mechanism or similar)

@orbeckst
Copy link
Member

In the process of adding parallel capabilities to the H5MD reader #2865 , @edisj brought up an important point that is relevant for the convertors: When you take a foreign object such as a h5py.File as input instead of a filename and you then do a Reader.reopen(), then you can loose information because during reopen() we will open the file ourselves and this might forget important arguments such as h5py.File(..., comm=MPI_COMMUNICATOR, driver="mpio") that fundamentally change how the object functions.

The broader question here is how the convertor API should handle reopen(): Should we make reopen() essentially rewind() == seek(0) if we didn't open a file ourselves?

@cbouy cbouy mentioned this issue Jul 30, 2020
4 tasks
@cbouy
Copy link
Member Author

cbouy commented Jul 31, 2020

@orbeckst isn't this for Readers/Parsers instead of converters ? or am I just too tired 😅 ?

@orbeckst
Copy link
Member

orbeckst commented Jul 31, 2020

I have been considering the fact that we can do

u = mda.Universe(thing)

where thing can be a PARMED or RDKIT or (soon) h5py.File object as part of the converter API.

So if you have a RDKIT thingy that generated 5 conformers (because you told RDKIT to so) then these conformers are now represented as a MDA trajectory. But when you hit u.trajectory.reopen() then suddenly everything is gone. Actually, I don't know how your RDKIT converter deals with reopen()... I might be using a flawed example but I hope you get the general idea that reopen() cannot in general mean the same thing as for file-based operations where MDAnalysis is actually controlling the file opening.

@orbeckst
Copy link
Member

You're right in that the original discussion centered on AtomGroup.convert_to. My point is that "the Converter API" is broader than just that. It might need a new issue but here was the best place to start a discussion.

IAlibay pushed a commit that referenced this issue May 10, 2021
Fixes #2790 

## Work done in this PR
* Creates a new converters API
  * Makes u.atoms.convert_to("PACKAGE") case insensitive
  * Allows passing kwargs to converters: u.atoms.convert_to("rdkit", NoImplicit=False)
  * Adds u.atoms.convert_to.package() methods automatically from the metaclass magic, with TAB completion support, and docstring from the converter
* Moves RDKit, Parmed, and OpenMM to MDAnalysis.converters

Authored-by: Cédric Bouysset <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants