Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

ehogan · 2022-01-14T18:02:21Z

🐛 Bug Report

I have two cubes. When I use cube1.metadata.difference(cube2.metadata) my metadata shows differences, even though the values look the same. However, when I use assert cube1.attributes == cube2.attributes I get no AssertionError.

I discovered this issue because:

with v2.4.0 (e.g. scitools/default-previous) I can run assert cube1.metadata == cube2.metadata and I get no AssertionError.
with v3.0.1 (e.g. scitools/default-current) I get an AssertionError when I run assert cube1.metadata == cube2.metadata.

How To Reproduce

% module load scitools
% python
>>> import iris
>>> import numpy as np
>>> cube1 = iris.cube.Cube(1, attributes={"value1": np.float64(89.5)})
>>> cube2 = iris.cube.Cube(1, attributes={"value1": 89.5})
>>> assert cube1.metadata == cube2.metadata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> cube1.metadata.difference(cube2.metadata)
CubeMetadata(standard_name=None, long_name=None, var_name=None, units=None, attributes=({'value1': 89.5}, {'value1': 89.5}), cell_methods=None)
>>> assert cube1.attributes == cube2.attributes
>>>

Expected behaviour

I'm not sure what would be best?

Either cube1.metadata.difference(cube2.metadata) shows the types in the difference, so it's clear where the difference is (because the values look the same and this took longer than I care to admit to figure out) 😝

Or assert cube1.attributes == cube2.attributes fails

Or both?

The text was updated successfully, but these errors were encountered:

bjlittle · 2022-01-19T11:01:08Z

@ehogan Under the hood for metadata attributes dictionary comparisons we use python-xxhash (see xxHash).

The fact that you can put scalar numpy arrays into the dictionary is the issue here. Although np.float(89.5) is equivalent to 89.5 they are not the same thing, and we've chosen to discriminate based on that.

This is why assert cube1.metadata == cube2.metadata fails, and the difference method is consistent with that behaviour and returns that there is indeed a difference between the attributes dictionaries.

Perhaps it might be more helpful to the user if the difference method highlighted this explicitly 🤔

You might reasonably ask why oh why do we care about such a difference?

... one concrete example is for cube arithmetic when the metadata for two operand cubes are combined into one resultant cube. Given your example above, which version of attributes["value1"] should be used in the resultant cubes metadata? The decision is not clear, hence the difference does kinda matter.

Note that, cube1.attributes == cube2.attributes is a simple Python dictionary comparison... and in this case gives a different result, as the Python builtin comparison being performed is different.

I can't imagine that we'll subclass a Dict and specialise the __eq__ behaviour to align with metadata __eq__.

At the very least this behaviour should be highlighted and clarified... but do note that you are highlighting a special case here of using a numpy scalar array. Your example doesn't extend to non-scalar numpy arrays e.g.,

>>> import numpy as np
>>> a = dict(values=np.arange(3))
>>> b = dict(values=[0, 1, 2]) 
>>> a == b
a == b
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Also, note that metadata combine, compare and difference behaviours avoid this type of ValueError due to the approach that we've adopted for special treatment of dealing with attributes dictionaries.

bjlittle · 2022-01-19T11:51:10Z

@ehogan Note also that his change in behaviour was introduced into iris 3.0.0, and was one of the many reasons for a major release of iris i.e., it breaks backwards compatibility for some behaviours. Reference #3785

larsbarring · 2022-01-27T13:29:52Z

And here is view from someone that all to frequently (to say the least...) have to deal with all sorts of metadata (and data) imperfections. Model data, that is supposed to be one homogeneous set, that was produced in different batches, on different machines, quality controlled by different people, using different [versions of] software tools, and good know what is the reason behind the imperfections. In the best of worlds such problems should be non-existing or data just rejected/fixed by the producers. But I am afraid that I do not live in that world.

Last year we processed something like 10,000-15,000 files from a large online database, and something like 60% had metadata (or data) imperfections. Most of these were easy to spot and could be fixed with a few steps of preprocessing. Remained something like 500 files making up about 30-50 supposedly homogeneous dataset units. Homogeneous in the sense that my_cube_list.concatenate_cube() should go down just like a sip of coffee. As it was impossible to imagine where these imperfections where, in in which file, in which metadata attribute, or which file(s) had coordinates that differed in LSB, trial and error was the method. Not preferred but at hand. A lot of coffee were used before that smooth sip finally went down.

So, what conclusions to to draw from this with any bearings on iris?

Python itself does a very good job in working with different types, be it integers or float64. Iris could aim in the same direction. In the example above, which is the more general type np.float(89.5) or 89.5?
Define some general rules for how to deal with this, something like * In the case of differences in [meta]data types iris will always attempt to convert to an equivalent numpy type and issue a warning. This warning can be muted by ...." In this way the program will continue and either chew on the next metadata imperfection or just happily run to the end.
Of course there might be situations where such a general rule is of no help, and raising an error is the only way.

I think that steps in this direction will be immensely useful all and everyone that are dealing with data from various sources.

rcomer · 2022-01-27T14:23:53Z

@larsbarring have you looked at cube_helper? I've not used it myself but think it was written for CMIP ensembles. I think we're all agreed that Iris's own merge and concatenate functions should be more lenient though.

ehogan · 2022-01-27T18:23:31Z

Thanks @bjlittle :) I'm happy with the discrimination between np.float(89.5) and 89.5. I do think it would be super helpful if the output showed the types of the values, as it is not clear what the difference is in the output (the values look the same to me!). So instead of:

CubeMetadata(attributes=({'value1': 89.5}, {'value1': 89.5}))

could the output be:

CubeMetadata(attributes=({'value1': np.float64(89.5)}, {'value1': 89.5}))

or similar?

Also, would it be worth adding to the documentation that assert cube1.attributes == cube2.attributes doesn't distinguish between types in the same way as cube1.metadata.difference(cube2.metadata)? If this is something you think is worth doing I'd be happy to have a go at a PR? Or, alternatively, I'd be happy to review something? :)

github-actions · 2023-06-12T00:15:57Z

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

ehogan · 2023-06-15T09:09:56Z

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

This issue is still important to me 😊

ehogan added New: Issue Type: Bug labels Jan 14, 2022

bjlittle self-assigned this Jan 19, 2022

trexfeathers removed the New: Issue label Jan 26, 2022

pp-mo mentioned this issue Jan 5, 2023

Cellmethod tolerance #5126

Draft

github-actions bot added the Stale A stale issue/pull-request label Jun 12, 2023

github-actions bot removed the Stale A stale issue/pull-request label Jun 16, 2023

scitools-ci bot removed this from 🚴 Peloton Dec 15, 2023

scitools-ci bot added this to 🚴 Peloton Dec 15, 2023

ESadek-MO unassigned bjlittle May 8, 2024

ESadek-MO added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

ehogan commented Jan 14, 2022 •

edited

Loading

bjlittle commented Jan 19, 2022 •

edited

Loading

bjlittle commented Jan 19, 2022

larsbarring commented Jan 27, 2022 •

edited

Loading

rcomer commented Jan 27, 2022

ehogan commented Jan 27, 2022 •

edited

Loading

github-actions bot commented Jun 12, 2023

ehogan commented Jun 15, 2023

Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

Comments

ehogan commented Jan 14, 2022 • edited Loading

🐛 Bug Report

How To Reproduce

Expected behaviour

bjlittle commented Jan 19, 2022 • edited Loading

bjlittle commented Jan 19, 2022

larsbarring commented Jan 27, 2022 • edited Loading

rcomer commented Jan 27, 2022

ehogan commented Jan 27, 2022 • edited Loading

github-actions bot commented Jun 12, 2023

ehogan commented Jun 15, 2023

ehogan commented Jan 14, 2022 •

edited

Loading

bjlittle commented Jan 19, 2022 •

edited

Loading

larsbarring commented Jan 27, 2022 •

edited

Loading

ehogan commented Jan 27, 2022 •

edited

Loading