-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinct instances of the same node do not compare equal #1917
Comments
@sphuber You mention here that there may be reasons for not implementing the behaviour asked for in this issue. |
Mmm not sure. However, I cannot think now to a real problem why implementing comparison by PK (or even better by UUID - in case in the future we support loading at the same time multiple profiles, and different nodes might have the same PK). I suggest to implement it unless we realise there's a problem. |
I think this would be mostly very difficult to implement correctly and be sure it always behaves correctly.
Since the UUID is what we consider to uniquely identify a node in our data model, I think it is OK to base the equality operation on this, so I am also in favor implementing this.
The only thing I can think of is where someone in code may change the UUID of an unstored node to an existing one manually and then do a comparison, but I don't think this should pose a problem. This example is outlandish and storing the node would fail now that there is a uniqueness constraint on the UUID column. |
@ltalirz @giovannipizzi @sphuber One thing I just discovered with this, is that currently there are some data nodes that use a different criteria to establish the concept of equality. I don't know the full extent of this, but I think this originates from the This now poses a series of problems:
I'm unsure on how to proceed. I am inclined towards going for uuid in all cases and just biting the bullet on 1 and 4, but I wanted to hear what you guys think. Maybe we should discuss this in the next aiida meeting... |
Very good catch. This does indeed complicate things. I definitely agree that we should have one or the other and not a mix of behavior that depends on the type. However, this would lead us naturally to only ever allowing to do comparison based on UUIDs. This would mean we just need to deprecate the current implementations of |
Yeah, I have to say I'm quite opposed to changing the equality behavior of the numeric types. That can break existing code in quite complicated and hard-to-find cases. What's the harm in comparing by UUID just as a fallback, for the types that don't have a custom comparison function? After all, that is e.g. how the built-in Python types do it (compare by value, and otherwise by |
Well, the risk is that if it is not consistently implemented by (all) data plugin types, it may give unintuitive behavior. For example, the Of course we could then implement this for Anyway, not really sure which would be more frustrating: not having all data plugins behave the same, or not being able to compare base data types based on their value. |
I think most frustrating would be the behavior for the Yeah I agree inconsistency in which |
Hmm, fair enough. Maybe the fallback on UUID is fine then. Let's see what others think |
Fallback to UUID when specific equality method is missing sounds good to me. |
So it might be true that Python itself does this, but I'm not sure I would think of these kinds of things as python "fortes". I mean, I understand the complexities behind the behavior and perhaps this was the right choice for them, but I'm not super convinced this means it is the best for us. Like, I already thought that the For example, one thing to consider is that if we set this fallback behavior to uuid equality, if we then decide that another node type would benefit from having a way to easily evaluate content equality (checking if you already have a pseudo for example), the change is no longer a simple feature addition but a backwards compatibility breaking change. And I feel that this kind of breaking is specially annoying to the user when there is not a strong reason for the original criteria nor for the change (particularly since there is a high probability they will be "clashing" with these things, i.e. realizing these differences mostly through errors when trying to use them). I think that if we are going to decide there is more worth in the concept of "equality of content", we need to be consistent and have this be the expected behavior for all node types. Actually, now that I think of it, this might even make more sense, since "equality of uuid" is just evaluating |
I think you make a valid point. The UUID fallback would mean that it becomes difficult to interpret the meaning of Since raising an exception would also be a breaking change, I'm just wondering: don't we already have a better fallback in the hashing mechanism? Could a comparison of the node hash be used as a fallback instead, if a specific method is not implemented? |
I agree it's not a critical feature, but one nice thing this enables is using nodes in built-in data structures, e.g.
I have a feeling this could have some even more puzzling corner cases 😅 Off the top of my head:
So.. it's not quite as easy as it seems, and the design considerations going into the hashing don't translate well to this problem I think.
It definitely shouldn't raise an exception - the Thing To Do™ in such a situation would be |
I think going into the reasons for this might be useful:
Both operations are "default to In that light So you can thing of Here, we have three different equivalence relations (in order of strongest to weakest):
If two nodes fulfill (a), they will fulfill (b) and (c). If they fulfill (b) they will also fulfill (c), but not necessarily (a). Currently, we implement (c) and fall back to (a) where that's not available. The question here is if we want to fall back to (b) instead. |
Ok, yes, that last comment gave some things to think about. I'm still wary about the unreliability of where it will do (c) or (a/b), but I guess that if you present it like that ("use this when you want to discern repeated content, with the caveat that you might get repetitions when we don't know how to check that content") it could be ok. I still need to process it a bit. |
Very interesting thread, and thanks for all useful discussions. BTW, just to stay on the topic of what python does, just for fun: # a bit weird to me, but ok...
3 == 3.00 # True
# Very good
3 == 2.999 # False
# For floating-point lovers ;-)
3 == 2.9999999999999999999999999999999 # True
# Luckily this is not javascript!
3 == "3" # False
# numpy makes things very strange (but turns out to be very useful in practice)
numpy.array([1]) == [1] # array([ True]) So the list of things that can happen is quite varied. Getting back to serious stuff. In [17]: bool(numpy.array([1, 2]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-c43772ca109d> in <module>
----> 1 bool(numpy.array([1, 2]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() Now, I see reasons for the various suggestions:
With this suggestion, I would still wait 2.0 (or even 2.1?) to implement then the equality of nodes based on UUID only (what I think is more intuitive). Comments welcome, also from workflow writers who might be using the overloaded classes a lot maybe? (But again I think there it's in the end still better/easier to know the type and use |
Yeah, I think I can get on board with dropping the numeric behavior of Since there's no "strong" reason to drop it, maybe it would also be good to get plugin developers' feedback. Some points I'd like to make:
|
I'd also like to add that as far as operator overloading goes, |
My biggest worry with all this is that we're introducing backwards-incompatible changes not for any particularly strong technical reason, but for reasons where reasonable minds (ourselves in the past, and maybe in the future) could come to the opposite conclusion. That's not to say that I completely oppose this change (as mentioned above), but I think it should be well-considered. Maybe the relationship our (built-in) data nodes have to their Python relatives needs to be noted down in an AEP. Here we're discussing removing behavior that is present in the Python built-ins from the data node, but in #4351 we're actually going the opposite way. I'd also warn against the mentality that "we're anyway doing 2.0, let's just change that also" [1]. In my opinion, we should always strive to be as backwards-compatible as possible. Sometimes that isn't possible without compromising the functionality, so then semantic versioning keeps track of that. But each backwards-incompatible change should by itself have a strong enough reason for existing, irrespective of version numbers. [1] Sorry for the straw-man argument @giovannipizzi - I know that's not what you said above. |
Hi, no worries @greschd - good to hear all opinions! In the end, if we get a reasonable number of people agree that the == behaviour defined by the plugins and falling back to UUIDs is intuitive enough for most people, I'm ok doing it. E.g.: a = Int(1).store()
b = Int(1).store()
a == b # What to you expect?
a = Dict({}).store()
b = Dict({}).store()
a == b # What to you expect? and so on, including cases where the equality would be true at the moment and cases where it wouldn't. And of course, this needs to be well documented! |
Since this came up again in #4736, I think we agree that nodes with the same UUID should compare equal. We could implement that, and punt the question of whether we should also compare by content (by keeping the current behavior). Or does someone want to take up doing that user study, and fix it "properly"? |
Sounds like a good way to make progress. Fully support this. |
In this approach, what should be the behaviour for Int nodes, for instance?
I need to re-read the whole thread, but maybe the easiest is to say that Anyway, I still think that it would be great if a volunteer ;-) could come up with a set of examples that we want to then check with users, to get a sense of what people intuitively expect. If we prepare the list here (both which examples to give, and which answers to get: I don't know, e.g. "I expect True, and would be very confused if I get False", "I expect True, but would understand if I get False", "I expect False, and would be very confused if I get True", "I expect False, but would understand if I get True", "I don't know what to expect", ...), we can then just convert it into a Google Forms and send it to the mailing list (maybe after testing it first among developers to see if the questions are well posed). |
I'm proposing a simple change to the current behavior: As a last fallback, compare by UUID instead of just returning I feel the UUID issue is important enough to fix soon (as evidenced by multiple issue cropping up over time), that we can leave the can of worms unopened right now. Probably we should then open a separate issue where we condense the options discussed in this thread. |
Currently there is an inconsistency in how the base data type node instances compare equality. All base types compare based on the content of the node, whereas `Dict` instances rely on the UUID fallback introduced in aiidateam#4753. After a long discussion started by aiidateam#1917, it was finally decided that the best way forward is to make the equality comparison consitent among the base types (see aiidateam#5187). Here we adapt the `__eq__` method of the `Dict` class to compare equality by content instead of relying on the fallback comparison of the UUIDs.
Currently there is an inconsistency in how the base data type node instances compare equality. All base types compare based on the content of the node, whereas `Dict` instances rely on the UUID fallback introduced in #4753. After a long discussion started by #1917, it was finally decided that the best way forward is to make the equality comparison consitent among the base types (see #5187). Here we adapt the `__eq__` method of the `Dict` class to compare equality by content instead of relying on the fallback comparison of the UUIDs.
Example input:
Output:
Tested with types
UpfData
,WorkCalculation
,PwCalculation
.EDIT: also tested with more "basic" types
aiida.orm.data.str.Str
andaiida.orm.data.int.Int
. For these, some more discussion is possibly required: should the PK/UUID be compared, or the actual data themselves?The text was updated successfully, but these errors were encountered: