✨ NEW: `MSONableNode` data type #4975

mbercx · 2021-06-06T09:37:23Z

Is your feature request related to a problem? Please describe

One of the challenges of developing for AiiDA is that in case you want to use a certain type of data, you need to be able to store it in the provenance as a node. This means that you can't just use a class representing a type of data in AiiDA without first writing a data type plugin. This can be a barrier to using AiiDA, since it requires extra work.

Below I describe an idea that I've been thinking about for a while to extend the number of data types in AiiDA with all MSONables, and explore some of the consequences for provenance etc. There may still be flaws and in the end we might abandon the idea, but I wanted to write this down so others could comment.

Describe the solution you'd like

Let's take pymatgen's Composition class as an example.

Most of the classes in pymatgen are MSONable, or Monty-JSON-serialisable. You can check the monty.json module for more details, but essentially every class that is MSONable needs to implement an as_dict() method, so that via the MontyEncoder and MontyDecoder, these classes can be JSON-serialized and stored in a file or a JSON-like database such as MongoDB (this is heavily used by Fireworks).

For the Composition example, this looks something like this:

In [1]: from pymatgen.core import Composition

In [2]: comp = Composition('LiCoO2')

In [3]: comp.to_json()
Out[3]: '{"Li": 1.0, "Co": 1.0, "O": 2.0, "@module": "pymatgen.core.composition", "@class": "Composition", "@version": "2020.12.31"}'

My suggestion is to consider if we can't leverage the JSON-serialisabilty of MSONable classes to extend the types of data we can store in the database with all MSONable ones. Since we can store JSON in the postgres database, this should be fairly straightforward. And since the module and class are stored in the dictionary returned by the to_json() method, we can reconstruct the original MSONable instance quite easily.

Methods, provenance and immutability

Obtaining the original class from the node should be fairly straightforward for the user:

In [1]: comp_node = load_node(<PK>)

In [2]: comp = comp_node.get_msonable()  # Or get_instance()?

The user would then obtain the original instance that was stored and be able to use it in whatever way they want. This I think is already a fine first implementation. They can adapt the instance and then store a node again afterwards.

Of course, these intermediate steps will not be stored in provenance. So maybe we can go one step further, and split up the methods of the class:

get methods that simply return data without modifying the Node instance. These should be easily transferable to the MSONableNode instance from the MSONable class. Continuing from the (hypothetical) example above:
```
In [3]: comp.get_atomic_fraction('Li')
Out[3]: 0.25

In [4]: comp_node.get_atomic_fraction('Li')
Out[4]: 0.25
```
set methods that change something about the instance. These obviously cannot just adapt the MSONableNode, since the nodes are immutable by design (and for good reason). However, we could have them return a new MSONableNode instance. Moreover, we could store the method used as a calcfunction, and add this operation to the provenance. Again, from the example:
```
In [5] comp_node.add_charges_from_oxi_state_guesses()
Out[5]: <MSONableNode[Composition]: uuid: b1bc0507-9e0b-464d-b8e7-27ec14f45a8a (pk: 5)>
```
This call would have been added to the provenance as a calcfunction, so:
```
$ verdi process list -a
  PK  Created    Process label                       Process State     Process status
----  ---------  ----------------------------------  ----------------  ----------------
 656  5s ago     add_charges_from_oxi_state_guesses  ⏹ Finished [0]
```

I'm not sure how difficult it is to check if a method is a get or set one. Just off the top of my head we'd probably load the origin MSONable instance, apply the method and then see if the as_dict() methods of the instance before and after are different. I think it should be feasible, but maybe I'm missing something.

Describe alternatives you've considered

Another thing I've thought about, but I'm not sure if it would be possible, is to write a MSONable -> AiiDA data type converter. I'd have to explore this some more to see if it's feasible (even partially, to get a template that can then be adapted). However, this might be an inferior idea, since that means that any change in e.g. the Composition class would have to be updated in the corresponding data type class. On the other hand, that may be desirable to avoid things breaking with a breaking change in pymatgen.

EDIT: Additional idea

An additional idea would be to extend the MSONable concept to AiiDA nodes, creating a e.g. a NodeAble class that can be added to any class to make instances of that class storable in an AiiDA database. I guess it would be a subclass of MSONable, where we rely on the as_dict() method for the JSON stored in the database and perhaps an to_repository() method that details how/what needs to be stored in the repository. Would be very cool if you could make any Python class storeable in an AiiDA database by simply:

Adding NodeAble as a parent class.
Implementing the as_dict() method.
Implementing the to_repository() method.

I think this would be easier and more maintainable than writing separate AiiDA data types for everything.

Maybe I'm missing some very important detail, and this is just a pipe dream. 🙃 Comments are most welcome!

The text was updated successfully, but these errors were encountered:

chrisjsewell · 2021-06-06T12:21:04Z

So I met @mbercx in the kitchen after seeing this. First I said "shut up marnik, you don't know what you are talking about", and he had a little cry. But then we started brainstorming (because I do see where he is coming from), and possibly the aim here, is to be able to wrap a "pure python" class, such that when you call methods on it, which mutate the data, it stores this to the database as: node -> calcfuncnode -> node

A very rough possible implementation is:

from functools import wraps

class PurePythonClass:
    def to_json(self) -> dict:
        return {}

    def mymethod(self):
        """Something"""
        return "hi"


def method_wrapper(f, wrapper: 'Wrapper'):
    """A method wrapper that stores data mutations to the aiida database."""
    @wraps(f)
    def func(*args, **kwds):
        output = f(*args, **kwds)
        new_state = wrapper._to_json(wrapper._wrapped)
        if new_state != wrapper._state:
            new_dict = Dict(dict=new_state)
            # now create a calcfunction and link it as incoming to the new_dict
            new_dict.store()
            # memoize this on wrapper class
        return output
    return func


class Wrapper:
    def __init__(self, wrapped, to_json):
        self._wrapped = wrapped
        self._to_json = to_json
        self._state = to_json(wrapped)

    def __getattr__(self, name: str):
        """Interject in the method call."""
        method = getattr(self._wrapped, name)
        if not method:
            raise AttributeError(name)
        return method_wrapper(method, self)

    def __dir__(self):
        """Return method/attribute names for the wrapped class, to allow tab completion."""
        return dir(self._wrapped)

pure_inst = PurePythonClass()
inst = Wrapper(pure_inst, lambda inst: inst.to_json())
new_inst = inst.mymethod()

again, very rough, but thats my quick two cents

mbercx · 2021-06-06T13:11:24Z

So I met @mbercx in the kitchen after seeing this. First I said "shut up marnik, you don't know what you are talking about", and he had a little cry.

Hey, I thought we agreed you wouldn't mention this in the issue! 😭 😉

mbercx · 2021-06-06T14:03:42Z

Another note: a user might want to track "get" methods which do not mutate the instance that's stored in the wrapper in the provenance as well. An example here could be the StructureData.get_primitive_structure() method. Since it might be difficult to identify which methods should or should not be stored in the provenance, we may have to store all methods in the provenance by default and offer some way of disabling some. This would lead to a lot of extra nodes though, a concern also raised by @chrisjsewell this morning as he was crushing my dreams.

chrisjsewell · 2021-06-06T14:55:09Z

Hey, I thought we agreed you wouldn't mention this in the issue!

Never trust an englishman! No jokes aside, it's good to have a mixture of "blue sky thinking" and practicality, and see where we can meet in the middle

mbercx · 2021-06-06T18:07:40Z

One more thing to consider: If we also store "get" methods in the provenance, it's possible that the output class is different from the class we're calling the method from. Consider again the Composition example:

In [4]: comp_node.get_atomic_fraction('Li')
Out[4]: 0.25

Getting an output node here is simple: a float can be converted into a Float. However, what if the output is some entirely different class that doesn't have methods to convert it into an AiiDA node?

mbercx · 2021-06-07T07:05:28Z

Even better would be if we could also use the constructor of a PurePythonClass to generate a node based on other PurePythonClasses. Imagine the following example, based on pymatgen's SpaceGroupAnalyzer:

I obtain a Structure wrapped in a node wrapper from a geometry optimization.
I use this structure in combination with a Float for sym_prec and angle_tolerance. This node creation step __init__ is stored as a calcfunction in the provenance.
I then use the SpaceGroupAnalyzer.get_conventional_standard_structure() method to obtain a new structure. This calculation step is once again stored in the provenance with its in and outputs.

sphuber · 2021-06-07T10:48:10Z

We have thought about this a lot in the past and the tricky part is indeed mutability. Getting a generic class that can wrap anything as long as it implements an as_dict (or comparable) method would probably be feasible. The problem indeed arises when people continue operating on these objects as they are used to by simply changing its state, even after storing. If you make the interface with AiiDA so transparent, it will be surprising if this doesn't work. Now you can indeed just silently return a copy of the node, unstored with the changes made, but this will lose provenance as you mentioned and may be a worse surprise later on. The only solution is to automatically create new nodes upon mutation through a calcfunction that is created on the fly, however, that runs into the problem of creating lots of nodes very easily for a sequence of trivial modifications. This may be in the end really what kills the usefulness and practicality of this transparent interface. One final thing that I thought of once, but never worked out and so don't know if this could potentially solve it, is to have a context manager within which mutations will be "cached" in memory and only upon exit will one single calcfunction be created to represent it. Sketch:

node = Structure()

with node.mutating_context():
    for site in node.sites:
        site.coords += [1, 1, 1]

I think it is possible to get the source code that is within a context manager so we can store that on the CalcFunctionNode that would be created. The problem I see though is that it is not clear how to define the inputs. In principle we should take all the arguments that are passed to setattr and setitem as an input. But that means they have to be storable and they probably aren't, but so should be automatically convertible. Alternatively, we could say that we don't actually record the inputs explicitly as nodes, but it is there implicitly in the source code. At least you can then track changes to nodes without creating explicit calcfunctions, however, the code is not reproducible. You cannot "rerun" the provenance and reproduce the results, which you can do in principle with actual explicit calcfunctions.

mbercx · 2021-06-07T18:32:46Z

that runs into the problem of creating lots of nodes very easily for a sequence of trivial modifications. This may be in the end really what kills the usefulness and practicality of this transparent interface.

Although I agree this is an issue if used in a loop as above, I'm not sure if it kills the idea if the feature is well-documented and the user understands what is going on. It could in fact be very powerful and easy to use as in the example of my previous comment. The problem here is that the SpaceGroupAnalyzer class isn't MSONable, and so we wouldn't be able to store it in the database automatically.

One final thing that I thought of once, but never worked out and so don't know if this could potentially solve it, is to have a context manager within which mutations will be "cached" in memory and only upon exit will one single calcfunction be created to represent it.

Indeed, @chrisjsewell had a similar idea, i.e. to store all modifications executed in one calcfunction node. Not sure how he was planning to do this though. ^^

chrisjsewell · 2021-06-07T21:02:52Z

So I was planning to add this before; it is an iteration on #4975 (comment), whereby rather than automatically storing modifications, it simply stores them in a list, which the user can then call the store() method to actually store them to the database.
This helps with the "not creating loads of nodes and links for trivial modifications".
It doesn't fully solve if you "transition" from one class type to another, e.g. different_cls_inst = cls_inst.method(), but it does propogate mutations when you do same_cls_inst_new = cls_inst.method().

from functools import wraps

class PurePythonClass:
    def to_json(self) -> dict:
        return {}

    def mymethod(self):
        """Something"""
        return "hi"


def method_wrapper(f, name: str, wrapper: 'Wrapper'):
    """A method wrapper that records data mutations."""
    @wraps(f)
    def func(*args, **kwds):
        output = f(*args, **kwds)
        new_state = wrapper._to_json(wrapper._wrapped)
        if new_state != wrapper._state:
            # maybe also store other data here about the mutation
		    wrapper._mutations.append({"method": name, "incoming": wrapper._state, "outgoing": new_state})
        wrapper._state = new_state
        if isinstance(output, type(wrapper._wrapped):
             # wrap it and store the mutations list
             output = ProvenanceWrapper(output, wrapper._to_json)
             output._mutations = wrapper.mutations
        return output
    return func


class ProvenanceWrapper:
    """A wrapper class than remembers mutation of the wrapped class instance."""

    def __init__(self, wrapped, to_json):
        self._wrapped = wrapped
        self._to_json = to_json
        self._state = to_json(wrapped)
        self._mutations = []

    @property
    def mutations(self):
        """Current (unstored) mutations."""
        return self._mutations[:]

    def store(self):
       """Store the mutations to the AiiDA profile."""
       # more logic here...
       new_dict = Dict(dict=new_state)
       new_dict.store()
       # now create a calcfunction and link it as incoming to the new_dict, etc
       # either record all mutations in a single calcfunction, or have one per mutation

       # reset mutations
       self._mutations = []

    def __getattr__(self, name: str):
        """Interject in the method call."""
        method = getattr(self._wrapped, name)
        if not method:
            raise AttributeError(name)
        return method_wrapper(method, name, self)

    def __dir__(self):
        """Return method/attribute names for the wrapped class, to allow tab completion."""
        return dir(self._wrapped) + ["mutations", "store"]

pure_inst = PurePythonClass()
inst = Wrapper(pure_inst, lambda inst: inst.to_json())
new_inst = inst.mymethod()

sphuber · 2021-06-08T07:30:31Z

Quick implementation for the MsonableData plugin type to wrap any MSONable object and make it storable:

import importlib

from aiida.orm import Data
from monty.json import MSONable


class MsonableData(Data):
    """Data plugin that allows to easily wrap objects that are MSONable.

    .. note:: As the ``MSONable`` mixin class requires, the wrapped object needs to implement the methods ``as_dict``
        and ``from_dict``.

    """

    def __init__(self, obj, *args, **kwargs):
        """Construct the node from the pymatgen object."""
        if obj is None:
            raise TypeError('the `obj` argument cannot be `None`.')

        if not isinstance(obj, MSONable):
            raise TypeError('the `obj` argument needs to implement the ``MSONable`` class.')

        for method in ['as_dict', 'from_dict']:
            if not hasattr(obj, method) or not callable(getattr(obj, method)):
                raise TypeError(f'the `obj` argument does not have the required `{method}` method.')

        super().__init__(*args, **kwargs)

        self._obj = obj
        self.set_attribute_many(obj.as_dict())

    def _get_object(self):
        """Return the cached wrapped MSONable object.

        .. note:: If the object is not yet present in memory, for example if the node was loaded from the database,
            the object will first be reconstructed from the state stored in the node attributes.

        """
        try:
            return self._obj
        except AttributeError:
            attributes = self.attributes
            class_name = attributes['@class']
            module_name = attributes['@module']

            try:
                module = importlib.import_module(module_name)
            except ImportError:
                raise ImportError(f'the objects module `{module_name}` can not be imported.')

            try:
                cls = getattr(module, class_name)
            except AttributeError:
                raise ImportError(f'the objects module `{module_name}` does not contain the class `{cls}`.')

            self._obj = cls.from_dict(attributes)
            return self._obj

    @property
    def obj(self):
        """Return the wrapped MSONable object."""
        return self._get_object()

Even without the whole mutability and provenance question, maybe this would already be useful to add to aiida-core as it allows any MSONable object to easily be wrapped and become storable as a node without having to create a dedicated data plugin.

zhubonan · 2021-07-07T11:26:18Z

This sounds very useful and requires not much (fingers crossed) effort to be added to the core!
For my use case, very often I need to import data into AiiDA that have been processed using fireworks or importing data from the material project, both are MSONable, so it would be useful to have the original records (immutable) stored in addition to the curated data.

mbercx added the type/feature request status undecided label Jun 6, 2021

mbercx self-assigned this Jun 6, 2021

mbercx mentioned this issue Jun 6, 2021

👌 IMPROVE: Hide dev/non-public API methods #4976

Closed

mbercx added topic/data-types priority/nice-to-have labels Jun 6, 2021

sphuber mentioned this issue Jul 9, 2021

Add the JsonableData data plugin #5017

Merged

mbercx mentioned this issue Aug 11, 2021

Add the ElectronicBandsData data plugin #5033

Closed

sphuber closed this as completed in #5017 Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ NEW: `MSONableNode` data type #4975

✨ NEW: `MSONableNode` data type #4975

mbercx commented Jun 6, 2021 •

edited

Loading

chrisjsewell commented Jun 6, 2021

mbercx commented Jun 6, 2021

mbercx commented Jun 6, 2021

chrisjsewell commented Jun 6, 2021

mbercx commented Jun 6, 2021

mbercx commented Jun 7, 2021

sphuber commented Jun 7, 2021 •

edited

Loading

mbercx commented Jun 7, 2021

chrisjsewell commented Jun 7, 2021

sphuber commented Jun 8, 2021

zhubonan commented Jul 7, 2021 •

edited

Loading

✨ NEW: MSONableNode data type #4975

✨ NEW: MSONableNode data type #4975

Comments

mbercx commented Jun 6, 2021 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Methods, provenance and immutability

Describe alternatives you've considered

EDIT: Additional idea

chrisjsewell commented Jun 6, 2021

mbercx commented Jun 6, 2021

mbercx commented Jun 6, 2021

chrisjsewell commented Jun 6, 2021

mbercx commented Jun 6, 2021

mbercx commented Jun 7, 2021

sphuber commented Jun 7, 2021 • edited Loading

mbercx commented Jun 7, 2021

chrisjsewell commented Jun 7, 2021

sphuber commented Jun 8, 2021

zhubonan commented Jul 7, 2021 • edited Loading

✨ NEW: `MSONableNode` data type #4975

✨ NEW: `MSONableNode` data type #4975

mbercx commented Jun 6, 2021 •

edited

Loading

sphuber commented Jun 7, 2021 •

edited

Loading

zhubonan commented Jul 7, 2021 •

edited

Loading