-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that ProcessNode.get_builder_restart
fully restores all inputs including metadata inputs
#5801
Ensure that ProcessNode.get_builder_restart
fully restores all inputs including metadata inputs
#5801
Conversation
4ae09fc
to
7a730b4
Compare
@zhubonan you might be interested in this. Would be good to have your feedback if you are still using AiiDA. |
Thanks this lookds great to me! This is a functionality that I have being waiting for a long time. I can see the get_builder_restart method is removed - I guess this is moved to the plumpy upstream? Just one comment - in the rase case that the user supplies some inputs that are not json serializable, would it except immediately? As for storing duplicated data, I think it is not a problem because before this the only way to make |
No, it is still there, it is defined on the
Yes, if the port is also |
Thanks, I see!
Yes it would be useful to emitt a warning I think. |
2f6277e
to
571499b
Compare
@ltalirz maybe you are interested in reviewing this? This is the first step required to a PoC I am working on that allows to rerun any workflow run in AiiDA and perfectly reproduce it. The second step is to support Docker containers (which I also have running) and final step is a function that, given a completed workflow node, can recreate the original builder exactly and relaunch it, automatically creating all the external codes necessary and rerunning in docker containers. As you have said before, this would be a very powerful demonstrator, and after this we are almost there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @sphuber, very nice to see that we are now on the road towards this project!
I guess one comment I would have is that, after this change, the AiiDA API is effectively lying to the user, right? The user explicitly requests an input to be non_db
, but AiiDA simply ignores this and stores the information in the database nevertheless.
If there was a reason for the user not to store this information in the database (e.g. sensitive / size / ...), this could be problematic.
It might be worth discussing this at the AiiDA meeting, i.e. including both the original rationale behind making some fields non_db
, and what a future name for this field could be.
aiida/engine/processes/process.py
Outdated
|
||
result[key] = non_db_value | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also
return result | |
return result or None |
here, unless you want the distinction between {}
and None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work, since None
is technically a value and so I could get a return value:
{
'sub': {
{
'namespace': {
'a': None
}
}
}
And now the prune
call to prune empty mappings will no longer work since technically the sub.namespace.a
namespace is no longer empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, this case is handled because None
values are automatically handled in the same recursive loop and ignored. So I think your suggestion also works, but effectively there is no change in behavior.
The original rationale of |
Ok, I see. I just browsed the changes quickly and it looked like the storing of this information was new, but I guess this applies only to the From your explanation, I guess the meaning is closer to |
Indeed, for the That being said, if a user defined a custom process with an input port designated as
Kind of yes, because they are not all stored in the same way. For example, the So although it may be possible in principle to generalize all this, this would require quite some discussions, data migrations and other laborious stuff that may not be worth it.
Absolutely, that is a given. |
248f9c5
to
fd31f2b
Compare
@ltalirz as promised I discussed with the team and we agree with you that it is best to not change the behavior of The only question remaining is that I haven't documented the new |
fd31f2b
to
38c9bbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber !
That seems like a good solution to me.
I understand that non_db
is not deprecated for the moment and the idea is to have both non_db
and metadata
supported for the time being? (fine with me)
aiida/engine/processes/ports.py
Outdated
return self._explicitly_set | ||
|
||
@property | ||
def metadata(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name was discussed, and the preference was for metadata
over is_metadata
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (@chrisjsewell and @giovannipizzi ) agreed on metadata
but I have to say we didn't consider a lot of other options. is_metadata
wasn't brought up for example. Happy to think about the naming if others think this is also a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brought it up during today's meeting and we agree that is_metadata
is a better name. Have updated the code and commits.
# Store JSON-serializable values of ``netadata`` ports in the node's attributes. Note that instead of passing in | ||
# the ``metadata`` inputs directly, the entire namespace of raw inputs is passed. The reason is that although | ||
# currently in ``aiida-core`` all input ports that set ``metadata=True`` in the port specification are located | ||
# within the ``metadata`` port namespace, this may not always be the case. The ``_filter_serializable_metadata`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to mention how you plan to use this rather than just that it could be used.
(or just remember to update this comment when you add your next PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially pertains to my last comment regarding the last remaining open question. Should we allow other plugins to add metadata
ports and have them store data in the process node's attributes? If so, then this is necessary. If we don't, then we may still want to keep this although it can be done differently. I am not sure what the answer to the question is though
The `_recursive_merge` method could raise a `KeyError` for a process builder that contains a dynamic port namespace. In this case it would be possible to merge in a value that contains a nested dictionay which would try to access directly `dictionary[key]`, where `dictionary` is the namespace of the builder, and so hit a `KeyError` since this nested namespace didn't exist yet in the dynamical namespace. The solution is to explicitly check if the `key` already exists in the builder's namespace, and if not, we assig the entire `value` to it, as there is no nothing to recurse into.
The method is a generic utility to operate on mappings and is not specific to the `ProcessBuilder`. It needs to be used elsewhere soon so therefore it is moved to a standalone function in a new utility module `aiida.engine.processes.utils`.
8c18f86
to
59ebc13
Compare
In the engine redesign from `v0.x` to `v1.0` the process interface needed a way to distinguish inputs that are lined up to the process node as nodes themselves and those that are stored directly on the process node, for example as an attribute. To this end the `non_db` keyword was added to the `InputPort`. This keyword was used for the `metadata` input namespace, to designate that these inputs would not be `Data` instances. The name is quite unfortunate, however, since the inputs of these ports actually do get stored in the database, contrary to what the keyword suggests. The most straightforward solution would be to rename the keyword, however, for better or worse, the keyword has been adopted by (a limited amount of) plugin packages. A known application is to pass in `Group` instances which are not storable as nodes, but can be passed through a `non_db` port. Hypothetically, users may have used the port as well to pass in sensitive data that should never be stored. Renaming the port or changing its behavior is therefore likely to break existing code. Instead, the `is_metadata` keyword is added to the `InputPort` and `PortNamespace` through the `WithMetadata` mixin. When set to `True` this keyword functions as the original intention of the `non_db` flag and an `is_metadata` port signals that its value will be stored in the database but directly on the `ProcessNode` instead of being linked as a `Data` node. The naming makes sense, because the only use of this keyword is by the `metadata` input namespace that all `Process` classes have. The inputs in this namespace are stored, through custom logic in the `Process` and `CalcJob` class, in the attributes of the node or dedicated columns of the node database model, such as the label and the description. This addition leaves the behavior of `non_db` inputs unchanged and so plugin packages that use this keyword should continue to function as before.
The promise of AiiDA's provenance is that all inputs to a `Process` are stored as nodes in the provenance graph linked to a node representing the process. From the very early beginning, however, there needed to be exceptions of inputs to processes that were not nodes. Notable examples were the various "options" set for calculation jobs. Even in AiiDA v0.x these "settings" as they were called back then, were stored in the attributes of the node. In the AiiDA v1.0 redesign, where all inputs of a process are defined through the process spec, these non-node inputs were implemented by allowing input ports to be made non-database storable, indicated by the `non-db` argument in the port declaration. These inputs would be passed to the `Process` instance and would be available during its lifetime, but would not be stored in the database. Once again there are exceptions as certain inputs defined by `aiida-core` are stored on the node, but in various places. Notable examples are the `label` and `description` of the process, and the `metadata.options` of the `CalcJob` class. This historical decision has as a direct result in that it is difficult if not impossible in certain cases to reconstruct the exact input dictionary that was used to run a `Process` from the data stored on the `ProcessNode`. From a provenance point of view, this is a huge weakpoint and is what is being corrected here. The input ports marked `non_db=True` on the base process classes provided by `aiida-core`, `Process` and `CalcJob` were changed to use the new `is_metadata` keyword instead in the previous commit, to remove the inconsistency between the naming and behavior. In this commit, all inputs that correspond to `is_metadata` ports and are JSON-serializable are stored in the attributes of the process node under the key `metadata_inputs`. All `is_metadata` input ports that are defined on process base classes by `aiida-core`, such as `Process` and `CalcJob` *are* JSON serializable. However, plugin packages can implement process classes with ports that accept inputs that are not JSON serializable, which is why this additional condition has to be added. But all inputs defined by `aiida-core` should be covered.
The `get_builder_restart` method on the `ProcessNode` base class would return a `ProcessBuilder` with the inputs set to those attached to that node instance. The `CalcJobNode` would override this to add the metadata options as well. This would be ok for `CalcJobNode`s, but if a restart builder was created from a `WorkChainNode` that calls a `CalcJobNode` it would have also received options, but those would not be restored. One could think that when calling `get_restart_builder` on a `WorkChainNode` that we can just go down the callstack, find all the `CalcJobNode`s and set the options in the respective input namespaces. But this would not exactly reproduce the original inputs, as the options that a calculation job has received could have been a modified version of the original options passed to the workchain, changed in the logic of the workchain. Instead, now that the exact metadata inputs for each process are stored in the attribute of the node, added in the previous commit, it is this dictionary that is used to restore the exact original inputs. It not only addresses the problem of incorrect `CalcJob` options, but it also restores any other metadata inputs such as the `label` and `description`.
59ebc13
to
d46d77b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber ! no further comments from my side
ProcessNode.get_builder_restart
fully restores all inputs including non_db
onesProcessNode.get_builder_restart
fully restores all inputs including metadata inputs
Fixes #4089
Note that this requires a new version ofThe updated version ofplumpy
to fix a critical bug.plumpy
has been released and the requirements have been updated.This essentially makes it possible to call
get_builder_restart
for any completedProcessNode
and actually get a builder that perfectly matched the inputs that were used for the creation of the node. Up till now this was not possible, especially not forWorkChain
s. This is the first step to making it possible and easy to, given a provenance graph, rerun any part of it to reproduce it. The one and only downside of this is that part of the input information is duplicated in the database. I feel that this duplication is worth the gain though.