Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ NEW: MSONableNode data type #4975

Closed
mbercx opened this issue Jun 6, 2021 · 11 comments · Fixed by #5017
Closed

✨ NEW: MSONableNode data type #4975

mbercx opened this issue Jun 6, 2021 · 11 comments · Fixed by #5017

Comments

@mbercx
Copy link
Member

mbercx commented Jun 6, 2021

Is your feature request related to a problem? Please describe

One of the challenges of developing for AiiDA is that in case you want to use a certain type of data, you need to be able to store it in the provenance as a node. This means that you can't just use a class representing a type of data in AiiDA without first writing a data type plugin. This can be a barrier to using AiiDA, since it requires extra work.

Below I describe an idea that I've been thinking about for a while to extend the number of data types in AiiDA with all MSONables, and explore some of the consequences for provenance etc. There may still be flaws and in the end we might abandon the idea, but I wanted to write this down so others could comment.

Describe the solution you'd like

Let's take pymatgen's Composition class as an example.

Most of the classes in pymatgen are MSONable, or Monty-JSON-serialisable. You can check the monty.json module for more details, but essentially every class that is MSONable needs to implement an as_dict() method, so that via the MontyEncoder and MontyDecoder, these classes can be JSON-serialized and stored in a file or a JSON-like database such as MongoDB (this is heavily used by Fireworks).

For the Composition example, this looks something like this:

In [1]: from pymatgen.core import Composition

In [2]: comp = Composition('LiCoO2')

In [3]: comp.to_json()
Out[3]: '{"Li": 1.0, "Co": 1.0, "O": 2.0, "@module": "pymatgen.core.composition", "@class": "Composition", "@version": "2020.12.31"}'

My suggestion is to consider if we can't leverage the JSON-serialisabilty of MSONable classes to extend the types of data we can store in the database with all MSONable ones. Since we can store JSON in the postgres database, this should be fairly straightforward. And since the module and class are stored in the dictionary returned by the to_json() method, we can reconstruct the original MSONable instance quite easily.

Methods, provenance and immutability

Obtaining the original class from the node should be fairly straightforward for the user:

In [1]: comp_node = load_node(<PK>)

In [2]: comp = comp_node.get_msonable()  # Or get_instance()?

The user would then obtain the original instance that was stored and be able to use it in whatever way they want. This I think is already a fine first implementation. They can adapt the instance and then store a node again afterwards.

Of course, these intermediate steps will not be stored in provenance. So maybe we can go one step further, and split up the methods of the class:

  • get methods that simply return data without modifying the Node instance. These should be easily transferable to the MSONableNode instance from the MSONable class. Continuing from the (hypothetical) example above:

    In [3]: comp.get_atomic_fraction('Li')
    Out[3]: 0.25
    
    In [4]: comp_node.get_atomic_fraction('Li')
    Out[4]: 0.25
  • set methods that change something about the instance. These obviously cannot just adapt the MSONableNode, since the nodes are immutable by design (and for good reason). However, we could have them return a new MSONableNode instance. Moreover, we could store the method used as a calcfunction, and add this operation to the provenance. Again, from the example:

    In [5] comp_node.add_charges_from_oxi_state_guesses()
    Out[5]: <MSONableNode[Composition]: uuid: b1bc0507-9e0b-464d-b8e7-27ec14f45a8a (pk: 5)>

    This call would have been added to the provenance as a calcfunction, so:

    $ verdi process list -a
      PK  Created    Process label                       Process State     Process status
    ----  ---------  ----------------------------------  ----------------  ----------------
     656  5s ago     add_charges_from_oxi_state_guesses  ⏹ Finished [0]

I'm not sure how difficult it is to check if a method is a get or set one. Just off the top of my head we'd probably load the origin MSONable instance, apply the method and then see if the as_dict() methods of the instance before and after are different. I think it should be feasible, but maybe I'm missing something.

Describe alternatives you've considered

Another thing I've thought about, but I'm not sure if it would be possible, is to write a MSONable -> AiiDA data type converter. I'd have to explore this some more to see if it's feasible (even partially, to get a template that can then be adapted). However, this might be an inferior idea, since that means that any change in e.g. the Composition class would have to be updated in the corresponding data type class. On the other hand, that may be desirable to avoid things breaking with a breaking change in pymatgen.

EDIT: Additional idea

An additional idea would be to extend the MSONable concept to AiiDA nodes, creating a e.g. a NodeAble class that can be added to any class to make instances of that class storable in an AiiDA database. I guess it would be a subclass of MSONable, where we rely on the as_dict() method for the JSON stored in the database and perhaps an to_repository() method that details how/what needs to be stored in the repository. Would be very cool if you could make any Python class storeable in an AiiDA database by simply:

  • Adding NodeAble as a parent class.
  • Implementing the as_dict() method.
  • Implementing the to_repository() method.

I think this would be easier and more maintainable than writing separate AiiDA data types for everything.


Maybe I'm missing some very important detail, and this is just a pipe dream. 🙃 Comments are most welcome!

@chrisjsewell
Copy link
Member

So I met @mbercx in the kitchen after seeing this. First I said "shut up marnik, you don't know what you are talking about", and he had a little cry. But then we started brainstorming (because I do see where he is coming from), and possibly the aim here, is to be able to wrap a "pure python" class, such that when you call methods on it, which mutate the data, it stores this to the database as: node -> calcfuncnode -> node

A very rough possible implementation is:

from functools import wraps

class PurePythonClass:
    def to_json(self) -> dict:
        return {}

    def mymethod(self):
        """Something"""
        return "hi"


def method_wrapper(f, wrapper: 'Wrapper'):
    """A method wrapper that stores data mutations to the aiida database."""
    @wraps(f)
    def func(*args, **kwds):
        output = f(*args, **kwds)
        new_state = wrapper._to_json(wrapper._wrapped)
        if new_state != wrapper._state:
            new_dict = Dict(dict=new_state)
            # now create a calcfunction and link it as incoming to the new_dict
            new_dict.store()
            # memoize this on wrapper class
        return output
    return func


class Wrapper:
    def __init__(self, wrapped, to_json):
        self._wrapped = wrapped
        self._to_json = to_json
        self._state = to_json(wrapped)

    def __getattr__(self, name: str):
        """Interject in the method call."""
        method = getattr(self._wrapped, name)
        if not method:
            raise AttributeError(name)
        return method_wrapper(method, self)

    def __dir__(self):
        """Return method/attribute names for the wrapped class, to allow tab completion."""
        return dir(self._wrapped)

pure_inst = PurePythonClass()
inst = Wrapper(pure_inst, lambda inst: inst.to_json())
new_inst = inst.mymethod()

again, very rough, but thats my quick two cents

@mbercx
Copy link
Member Author

mbercx commented Jun 6, 2021

So I met @mbercx in the kitchen after seeing this. First I said "shut up marnik, you don't know what you are talking about", and he had a little cry.

Hey, I thought we agreed you wouldn't mention this in the issue! 😭 😉

@mbercx
Copy link
Member Author

mbercx commented Jun 6, 2021

Another note: a user might want to track "get" methods which do not mutate the instance that's stored in the wrapper in the provenance as well. An example here could be the StructureData.get_primitive_structure() method. Since it might be difficult to identify which methods should or should not be stored in the provenance, we may have to store all methods in the provenance by default and offer some way of disabling some. This would lead to a lot of extra nodes though, a concern also raised by @chrisjsewell this morning as he was crushing my dreams.

@chrisjsewell
Copy link
Member

Hey, I thought we agreed you wouldn't mention this in the issue!

Never trust an englishman! No jokes aside, it's good to have a mixture of "blue sky thinking" and practicality, and see where we can meet in the middle

@mbercx
Copy link
Member Author

mbercx commented Jun 6, 2021

One more thing to consider: If we also store "get" methods in the provenance, it's possible that the output class is different from the class we're calling the method from. Consider again the Composition example:

In [4]: comp_node.get_atomic_fraction('Li')
Out[4]: 0.25

Getting an output node here is simple: a float can be converted into a Float. However, what if the output is some entirely different class that doesn't have methods to convert it into an AiiDA node?

@mbercx
Copy link
Member Author

mbercx commented Jun 7, 2021

Even better would be if we could also use the constructor of a PurePythonClass to generate a node based on other PurePythonClasses. Imagine the following example, based on pymatgen's SpaceGroupAnalyzer:

  1. I obtain a Structure wrapped in a node wrapper from a geometry optimization.
  2. I use this structure in combination with a Float for sym_prec and angle_tolerance. This node creation step __init__ is stored as a calcfunction in the provenance.
  3. I then use the SpaceGroupAnalyzer.get_conventional_standard_structure() method to obtain a new structure. This calculation step is once again stored in the provenance with its in and outputs.

@sphuber
Copy link
Contributor

sphuber commented Jun 7, 2021

We have thought about this a lot in the past and the tricky part is indeed mutability. Getting a generic class that can wrap anything as long as it implements an as_dict (or comparable) method would probably be feasible. The problem indeed arises when people continue operating on these objects as they are used to by simply changing its state, even after storing. If you make the interface with AiiDA so transparent, it will be surprising if this doesn't work. Now you can indeed just silently return a copy of the node, unstored with the changes made, but this will lose provenance as you mentioned and may be a worse surprise later on. The only solution is to automatically create new nodes upon mutation through a calcfunction that is created on the fly, however, that runs into the problem of creating lots of nodes very easily for a sequence of trivial modifications. This may be in the end really what kills the usefulness and practicality of this transparent interface. One final thing that I thought of once, but never worked out and so don't know if this could potentially solve it, is to have a context manager within which mutations will be "cached" in memory and only upon exit will one single calcfunction be created to represent it. Sketch:

node = Structure()

with node.mutating_context():
    for site in node.sites:
        site.coords += [1, 1, 1]

I think it is possible to get the source code that is within a context manager so we can store that on the CalcFunctionNode that would be created. The problem I see though is that it is not clear how to define the inputs. In principle we should take all the arguments that are passed to setattr and setitem as an input. But that means they have to be storable and they probably aren't, but so should be automatically convertible. Alternatively, we could say that we don't actually record the inputs explicitly as nodes, but it is there implicitly in the source code. At least you can then track changes to nodes without creating explicit calcfunctions, however, the code is not reproducible. You cannot "rerun" the provenance and reproduce the results, which you can do in principle with actual explicit calcfunctions.

@mbercx
Copy link
Member Author

mbercx commented Jun 7, 2021

that runs into the problem of creating lots of nodes very easily for a sequence of trivial modifications. This may be in the end really what kills the usefulness and practicality of this transparent interface.

Although I agree this is an issue if used in a loop as above, I'm not sure if it kills the idea if the feature is well-documented and the user understands what is going on. It could in fact be very powerful and easy to use as in the example of my previous comment. The problem here is that the SpaceGroupAnalyzer class isn't MSONable, and so we wouldn't be able to store it in the database automatically.

One final thing that I thought of once, but never worked out and so don't know if this could potentially solve it, is to have a context manager within which mutations will be "cached" in memory and only upon exit will one single calcfunction be created to represent it.

Indeed, @chrisjsewell had a similar idea, i.e. to store all modifications executed in one calcfunction node. Not sure how he was planning to do this though. ^^

@chrisjsewell
Copy link
Member

So I was planning to add this before; it is an iteration on #4975 (comment), whereby rather than automatically storing modifications, it simply stores them in a list, which the user can then call the store() method to actually store them to the database.
This helps with the "not creating loads of nodes and links for trivial modifications".
It doesn't fully solve if you "transition" from one class type to another, e.g. different_cls_inst = cls_inst.method(), but it does propogate mutations when you do same_cls_inst_new = cls_inst.method().

from functools import wraps

class PurePythonClass:
    def to_json(self) -> dict:
        return {}

    def mymethod(self):
        """Something"""
        return "hi"


def method_wrapper(f, name: str, wrapper: 'Wrapper'):
    """A method wrapper that records data mutations."""
    @wraps(f)
    def func(*args, **kwds):
        output = f(*args, **kwds)
        new_state = wrapper._to_json(wrapper._wrapped)
        if new_state != wrapper._state:
            # maybe also store other data here about the mutation
		    wrapper._mutations.append({"method": name, "incoming": wrapper._state, "outgoing": new_state})
        wrapper._state = new_state
        if isinstance(output, type(wrapper._wrapped):
             # wrap it and store the mutations list
             output = ProvenanceWrapper(output, wrapper._to_json)
             output._mutations = wrapper.mutations
        return output
    return func


class ProvenanceWrapper:
    """A wrapper class than remembers mutation of the wrapped class instance."""

    def __init__(self, wrapped, to_json):
        self._wrapped = wrapped
        self._to_json = to_json
        self._state = to_json(wrapped)
        self._mutations = []

    @property
    def mutations(self):
        """Current (unstored) mutations."""
        return self._mutations[:]

    def store(self):
       """Store the mutations to the AiiDA profile."""
       # more logic here...
       new_dict = Dict(dict=new_state)
       new_dict.store()
       # now create a calcfunction and link it as incoming to the new_dict, etc
       # either record all mutations in a single calcfunction, or have one per mutation

       # reset mutations
       self._mutations = []

    def __getattr__(self, name: str):
        """Interject in the method call."""
        method = getattr(self._wrapped, name)
        if not method:
            raise AttributeError(name)
        return method_wrapper(method, name, self)

    def __dir__(self):
        """Return method/attribute names for the wrapped class, to allow tab completion."""
        return dir(self._wrapped) + ["mutations", "store"]

pure_inst = PurePythonClass()
inst = Wrapper(pure_inst, lambda inst: inst.to_json())
new_inst = inst.mymethod()

@sphuber
Copy link
Contributor

sphuber commented Jun 8, 2021

Quick implementation for the MsonableData plugin type to wrap any MSONable object and make it storable:

import importlib

from aiida.orm import Data
from monty.json import MSONable


class MsonableData(Data):
    """Data plugin that allows to easily wrap objects that are MSONable.

    .. note:: As the ``MSONable`` mixin class requires, the wrapped object needs to implement the methods ``as_dict``
        and ``from_dict``.

    """

    def __init__(self, obj, *args, **kwargs):
        """Construct the node from the pymatgen object."""
        if obj is None:
            raise TypeError('the `obj` argument cannot be `None`.')

        if not isinstance(obj, MSONable):
            raise TypeError('the `obj` argument needs to implement the ``MSONable`` class.')

        for method in ['as_dict', 'from_dict']:
            if not hasattr(obj, method) or not callable(getattr(obj, method)):
                raise TypeError(f'the `obj` argument does not have the required `{method}` method.')

        super().__init__(*args, **kwargs)

        self._obj = obj
        self.set_attribute_many(obj.as_dict())

    def _get_object(self):
        """Return the cached wrapped MSONable object.

        .. note:: If the object is not yet present in memory, for example if the node was loaded from the database,
            the object will first be reconstructed from the state stored in the node attributes.

        """
        try:
            return self._obj
        except AttributeError:
            attributes = self.attributes
            class_name = attributes['@class']
            module_name = attributes['@module']

            try:
                module = importlib.import_module(module_name)
            except ImportError:
                raise ImportError(f'the objects module `{module_name}` can not be imported.')

            try:
                cls = getattr(module, class_name)
            except AttributeError:
                raise ImportError(f'the objects module `{module_name}` does not contain the class `{cls}`.')

            self._obj = cls.from_dict(attributes)
            return self._obj

    @property
    def obj(self):
        """Return the wrapped MSONable object."""
        return self._get_object()

Even without the whole mutability and provenance question, maybe this would already be useful to add to aiida-core as it allows any MSONable object to easily be wrapped and become storable as a node without having to create a dedicated data plugin.

@zhubonan
Copy link
Contributor

zhubonan commented Jul 7, 2021

This sounds very useful and requires not much (fingers crossed) effort to be added to the core!
For my use case, very often I need to import data into AiiDA that have been processed using fireworks or importing data from the material project, both are MSONable, so it would be useful to have the original records (immutable) stored in addition to the curated data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants