Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix regression in time-like check when decoding masked data #8277

Merged
merged 4 commits into from
Oct 18, 2023

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Oct 6, 2023

@kmuehlbauer
Copy link
Contributor Author

This fix works for timedelta units in any case by checking for equality with any of the unit-strings (e.g. days) For datetime unit presence of since is checked in addition to presence of any of the unit-strings (e.g. nanoseconds).

There might be other unit-strings possible, like days since system setup which still will be detected as time-like. To even catch these cases we would need to parse the whole units string for the three parts ("UNIT since DATE").

@spencerkclark
Copy link
Member

Thanks for tracking this down @kmuehlbauer -- looks good. Could you maybe just add a simple test?

There might be other unit-strings possible, like days since system setup which still will be detected as time-like. To even catch these cases we would need to parse the whole units string for the three parts ("UNIT since DATE").

Right, yeah, I think this is something we have been living with for a while within our datetime decoding logic:

if isinstance(units, str) and "since" in units:

In that case (regardless of what happens during masking) an error will eventually be raised related to the fact that what comes after "since" is not a valid date, which I think is OK.

@kmuehlbauer
Copy link
Contributor Author

Thanks @spencerkclark, I'll add a test along the lines of the example in #8269 and check that variables with non time-like units are not treated as time-like.

@spencerkclark
Copy link
Member

spencerkclark commented Oct 6, 2023

Sounds good, thanks!

In that case (regardless of what happens during masking) an error will eventually be raised related to the fact that what comes after "since" is not a valid date, which I think is OK.

Hmm as I think about this more, I guess we might need to worry about the case when one uses decode_times=False when opening a file. In that case what happens during masking could matter, so maybe we do need some more careful checking there after all. I think running the units through coding.times._unpack_netcdf_time_units in the case that "since" is present and ensuring that no error is raised might be sufficient.

@kmuehlbauer
Copy link
Contributor Author

I think running the units through coding.times._unpack_netcdf_time_units in the case that "since" is present and ensuring that no error is raised might be sufficient.

Yes, that makes sense. I've avoided so far importing from coding.times into coding.variables. We might have to move that functionality over to coding.variables to not run into circular dependencies. Or is there another solution to this.

@spencerkclark
Copy link
Member

Maybe we actually take things a step further and not apply this special time-masking behavior at all in the case that decode_times or decode_timedelta are False. Similar to the use_cftime parameter, we could add those parameters to the constructor for the CFMaskCoder and use them during decoding.

What are your thoughts on that? I guess I'm wondering whether this special masking behavior (replacing integer missing values with the minimum integer) should apply if we are not ultimately decoding values as times.

@kmuehlbauer
Copy link
Contributor Author

I've wrapped my head around that many times without coming to a robust and nice solution. Sometimes we can't clearly separate the different coding steps.

Consider the following example. If we do not keep/treat time-like with int64 resolution in CFMaskCoder it's not possible to do something like this (with packed data):

ds = xr.open_dataset(filename, decode_times=False)
ds = ds.pipe(fix_time_units)
ds = xr.decode_cf(decode_times=True)

without losing nanosecond resolution (because conversion into floating point).

Yes, we could advertise to complete drop cf decoding in those cases instead of just dropping time decoding.

@spencerkclark
Copy link
Member

Good point, I guess it really is a coupled problem...one way or the other you will need to completely drop CF decoding to fix something (either remove the time-like units to prevent special decoding of time-like missing values, or fix the time units without losing nanosecond precision).

On balance as a user I might be more surprised to see time-like units playing a role in how fields were masked in the case that decode_times=False or decode_timedelta=False than if nanosecond precision was lost after fixing the time units. I'm open to opinions from you and others though -- maybe @dcherian also has thoughts?

@spencerkclark
Copy link
Member

Maybe we can punt on the question of whether to apply the special time masking or not when decode_times or decode_timedelta are False, since it is not required to answer to address #8269? It would be good to get at least this fix in.

We might have to move that functionality over to coding.variables to not run into circular dependencies. Or is there another solution to this.

Ah gotcha, for this, one alternative is to import coding.times._unpack_netcdf_time_units within the _is_time_like function.

@kmuehlbauer
Copy link
Contributor Author

Thanks Spencer! I'm currently traveling and can pick up work next week. Feel free to push this forward.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmuehlbauer!

xarray/tests/test_conventions.py Outdated Show resolved Hide resolved
@dcherian dcherian merged commit 8f3a302 into pydata:main Oct 18, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0'
3 participants