fix regression in time-like check when decoding masked data #8277

kmuehlbauer · 2023-10-06T07:46:03Z

Closes open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

kmuehlbauer · 2023-10-06T07:56:15Z

This fix works for timedelta units in any case by checking for equality with any of the unit-strings (e.g. days) For datetime unit presence of since is checked in addition to presence of any of the unit-strings (e.g. nanoseconds).

There might be other unit-strings possible, like days since system setup which still will be detected as time-like. To even catch these cases we would need to parse the whole units string for the three parts ("UNIT since DATE").

spencerkclark · 2023-10-06T10:56:56Z

Thanks for tracking this down @kmuehlbauer -- looks good. Could you maybe just add a simple test?

There might be other unit-strings possible, like days since system setup which still will be detected as time-like. To even catch these cases we would need to parse the whole units string for the three parts ("UNIT since DATE").

Right, yeah, I think this is something we have been living with for a while within our datetime decoding logic:

xarray/xarray/coding/times.py

Line 826 in 2cd8f96

if isinstance(units, str) and "since" in units:

In that case (regardless of what happens during masking) an error will eventually be raised related to the fact that what comes after "since" is not a valid date, which I think is OK.

kmuehlbauer · 2023-10-06T11:00:40Z

Thanks @spencerkclark, I'll add a test along the lines of the example in #8269 and check that variables with non time-like units are not treated as time-like.

spencerkclark · 2023-10-06T11:22:40Z

Sounds good, thanks!

In that case (regardless of what happens during masking) an error will eventually be raised related to the fact that what comes after "since" is not a valid date, which I think is OK.

Hmm as I think about this more, I guess we might need to worry about the case when one uses decode_times=False when opening a file. In that case what happens during masking could matter, so maybe we do need some more careful checking there after all. I think running the units through coding.times._unpack_netcdf_time_units in the case that "since" is present and ensuring that no error is raised might be sufficient.

kmuehlbauer · 2023-10-06T11:30:17Z

I think running the units through coding.times._unpack_netcdf_time_units in the case that "since" is present and ensuring that no error is raised might be sufficient.

Yes, that makes sense. I've avoided so far importing from coding.times into coding.variables. We might have to move that functionality over to coding.variables to not run into circular dependencies. Or is there another solution to this.

spencerkclark · 2023-10-06T11:44:38Z

Maybe we actually take things a step further and not apply this special time-masking behavior at all in the case that decode_times or decode_timedelta are False. Similar to the use_cftime parameter, we could add those parameters to the constructor for the CFMaskCoder and use them during decoding.

What are your thoughts on that? I guess I'm wondering whether this special masking behavior (replacing integer missing values with the minimum integer) should apply if we are not ultimately decoding values as times.

kmuehlbauer · 2023-10-06T12:12:15Z

I've wrapped my head around that many times without coming to a robust and nice solution. Sometimes we can't clearly separate the different coding steps.

Consider the following example. If we do not keep/treat time-like with int64 resolution in CFMaskCoder it's not possible to do something like this (with packed data):

ds = xr.open_dataset(filename, decode_times=False)
ds = ds.pipe(fix_time_units)
ds = xr.decode_cf(decode_times=True)

without losing nanosecond resolution (because conversion into floating point).

Yes, we could advertise to complete drop cf decoding in those cases instead of just dropping time decoding.

spencerkclark · 2023-10-06T12:55:18Z

Good point, I guess it really is a coupled problem...one way or the other you will need to completely drop CF decoding to fix something (either remove the time-like units to prevent special decoding of time-like missing values, or fix the time units without losing nanosecond precision).

On balance as a user I might be more surprised to see time-like units playing a role in how fields were masked in the case that decode_times=False or decode_timedelta=False than if nanosecond precision was lost after fixing the time units. I'm open to opinions from you and others though -- maybe @dcherian also has thoughts?

spencerkclark · 2023-10-11T22:51:24Z

Maybe we can punt on the question of whether to apply the special time masking or not when decode_times or decode_timedelta are False, since it is not required to answer to address #8269? It would be good to get at least this fix in.

We might have to move that functionality over to coding.variables to not run into circular dependencies. Or is there another solution to this.

Ah gotcha, for this, one alternative is to import coding.times._unpack_netcdf_time_units within the _is_time_like function.

kmuehlbauer · 2023-10-12T12:45:43Z

Thanks Spencer! I'm currently traveling and can pick up work next week. Feel free to push this forward.

spencerkclark

Thanks @kmuehlbauer!

xarray/tests/test_conventions.py

fix regression in time-like check when decoding masked data

dd1e8dc

kmuehlbauer requested a review from spencerkclark October 6, 2023 07:58

add test

379cf8e

use _unpack_netcdf_time_units to check for proper units-string

996bafc

spencerkclark approved these changes Oct 18, 2023

View reviewed changes

xarray/tests/test_conventions.py Outdated Show resolved Hide resolved

test with decode_times

d29dcf9

dcherian merged commit 8f3a302 into pydata:main Oct 18, 2023
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix regression in time-like check when decoding masked data #8277

fix regression in time-like check when decoding masked data #8277

kmuehlbauer commented Oct 6, 2023 •

edited

Loading

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023 •

edited

Loading

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

spencerkclark commented Oct 11, 2023

kmuehlbauer commented Oct 12, 2023

spencerkclark left a comment

fix regression in time-like check when decoding masked data #8277

fix regression in time-like check when decoding masked data #8277

Conversation

kmuehlbauer commented Oct 6, 2023 • edited Loading

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023 • edited Loading

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

kmuehlbauer commented Oct 6, 2023

spencerkclark commented Oct 6, 2023

spencerkclark commented Oct 11, 2023

kmuehlbauer commented Oct 12, 2023

spencerkclark left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Oct 6, 2023 •

edited

Loading

spencerkclark commented Oct 6, 2023 •

edited

Loading