[Task]: Investigate xcdat handling floating point data types and missing data #275

pochedls · 2022-05-27T23:35:27Z

pochedls
May 27, 2022
Collaborator

Describe the task

Much of the geospatial data that we use has missing / masked values. This issue is intended to make sure that missing data is properly handled by xcdat. In particular, we need to make sure that xcdat handles missing data correctly for:

spatial averages
temporal averages
regridding

Notes on weighted averaging

Spatial and temporal averages are weighted averages. Spatial averages are typically weighted by the area in each grid cell and temporal averages are weighted by the length of time for each time interval. In general, a weighted average is just:

WA(x) = Σ(w(x) * v(x)) / Σ(w(x))

where WA is the weighted average at time/location, x, for a given weight, w, and value, v.

If a value is missing, its weight should be set to zero. For example, if I have arrays v = [99, 80, 77, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 87.0 (think a weighted grade average with homeworks worth 10 points and a quiz worth 30).

Now suppose the teacher says that I can miss one homework, then I have arrays v = [99, 80, np.nan, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 76.0 [or nan if I don't use np.nansum]. This is impossible, because I don't have any grades below 87 – so I can't get an average below 87. This is because the quiz that has a nan value is still being weighted. I need to zero out that weight (w = [10, 10, np.nan, 10, 30]), yielding WA = 88.67.

The take home message is that we need to ensure that values that are missing / masked are zero-ed out for spatial and temporal averaging.

The current spatial and temporal averages utilize the xarray .weighted().mean() API, which generally appear to handle missing data appropriately, though special attention may be needed for groupby averaging operations (used in temporal averaging). One question is whether a group of values that include a missing value (e.g., May, June, July temperature) should return a NaN or weighted average of the available data.

These notes are subsetted from a conversation with @tomvothecoder

taylor13 · 2022-06-03T23:25:29Z

taylor13
Jun 3, 2022

Regarding the question of a single missing month in a seasonal average, it may not make much practical difference. But when computing an annual mean, if all the winter or all the summer months were missing, then you could get a quite biased result (for things like temperature, at least).

For the annual mean calculation in CDAT we came up with two criteria (set by the user) that would determine whether a mean was recorded or the value was set to missing: 1) the threshold fraction of samples required to compute the mean, and 2) how far the "centroid" of weights computed from available samples differed from the centroid of weights computed assuming no missing samples. If you consider monthly data as numbers on a clock, then for no missing data the centroid lies at the axis of the clock hands. Similarly, if data were only available for 4 months but equally distributed (say, January, April, July, and October), the centroid would still be at the center. But if the 4 months all occurred in one half of the year, then the centroid would be offset. When computing an annual mean, the user would specify what minimum number of the months was required and how close the centroid should be to the center of the clock.

I would be happy to discuss further.

0 replies

taylor13 · 2022-06-03T23:33:03Z

taylor13
Jun 3, 2022

Another suggestion: Sometimes carrying both the "unmasked" weights and a second array of "masking" factors can be useful when constructing algorithms. When computing means, you would then calculate the sum-over-samples(wts x msk x data) and divide by sum-over-samples(wts x msk). In general "msk" would be a fraction (set to 0 for missing values). When regridding conservatively, the "wts" would be set to the area of each grid cell and the "msk" would indicate the fraction of each grid cell that was unmasked. The output of the regridder would give the area of each target grid cell and the unmasked fraction of each target cell, along with the regridded field itself.

0 replies

tomvothecoder · 2022-07-06T16:05:05Z

tomvothecoder
Jul 6, 2022
Maintainer

I am going to convert this issue to a discussion to make it easier to follow individual threads.

0 replies

tomvothecoder · 2022-07-06T16:07:55Z

tomvothecoder
Jul 6, 2022
Maintainer

@pochedls continued investigating the floating point differences in spatial averaging between CDAT vs. xCDAT.

Per Steve on 6/22/22 over email:

I think I am (kind of) understanding these small differences in spatial averaging. The mystery is that if I perform a weighted average myself, I match CDAT, but I differ from xcdat (order 0.002 differences for values of ~270 – small, but I would have thought we would be slightly more accurate).

A spatial average is just a weighted average where:

WA = sum(DA * W) / sum(W)

where DA is the dataarray, W is a weighting matrix, and WA is the resulting weighted average. We can write this as:

WA = num / den

where num = sum(DA * W) and den = sum(W). I found that my own numerator and denominator values were not matching xarray (here), which are ultimately implemented as two separate xarray dot products (here) via the _reduce function (here).

The underlying data I am looking at is type float (it shows up as float32 in the dataarray). Weirdly, I can match CDAT and my own calculations if I specify dtype=float64:

numerator_xr = xr.dot(da, weights, dims=['lat', 'lon'], dtype=np.float64)
denominator_xr = xr.dot(mask, weights, dims=['lat', 'lon'], dtype=np.float64)

I couldn’t find anything about numpy operations being performed at double precision (even if data is single precision), but I think this suggests that these differences may be okay.

....here is my spatial averaging code.

This yields the same result as CDAT within 7E-5 whereas xcdat is off by 2E-3 for this file (note that I specified an offset since this is an anomaly dataset with values close to zero and I wanted to get around some of the absolute/relative tolerance issues we discussed today).

# %% explicit method
fn = '/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc'
offset = 270.
 
# open dataset
ds = xcdat.open_dataset(fn)
ds[v] = np.squeeze(ds[v]) + offset
 
# get weights
weights = ds.spatial.get_weights(axis=['Y', 'X'], lat_bounds=None, lon_bounds=None)
 
# reshape to match the data array
ntime = len(ds['time'])
weights = np.tile(np.expand_dims(weights, axis=0), (ntime, 1, 1))
 
# zero out NaNs
weights = np.where(~np.isnan(ds[v]), weights, 0.)
 
# calculate spatial average
numerator = np.sum(weights*ds[v], axis=(1, 2))
denominator = np.sum(weights, axis=(1, 2))
ts_explicit = np.array(numerator / denominator)
 
# tidy up
ds.close()

2 replies

tomvothecoder Jul 6, 2022
Maintainer

@tomvothecoder follow up (6/28/22)

I did some investigation to figure out if the dtypes are changing in CDAT and/or xCDAT.

What is the input dtype for the variable he was working with (“tempanomaly”)?

The input dtype is float32

What is the output dtype of ds.spatial.average()?

ds.spatial.average() maintains a dtype of float32, which is CORRECT.

What is the output dtype of cdutil.averager()/genutil.averager()?

cdutil.averager()/genutil.averager() is typecasting the dtype to float64, which is INCORRECT.

Since the output dtypes are different, CDAT and xCDAT produce larger floating point diffs than what we expect or want.
We are more confident that xarray/xCDAT is doing the right thing when handling floating points types and values.

@jasonb5's follow up (7/1/22)

`np.sum` vs. `np.einsum`

The difference between Steve’s code and xCDAT’s code appears to be how the numerator and denominator are calculated.

For instance, in Steve’s code the numerator is calculated as sum(data*weights, axis=[1, 2]).
Diving through xCDAT and xArray’s code I found the numerator is calculated with xr.dot(data, weights, dims=[“lon”, “lat”]). xArray says their dot method is the equivalent of (data * weights).sum([1, 2]) (https://github.com/pydata/xarray/blob/aef5377a5af33a635a651123af7caa7435c24931/xarray/core/weighted.py#L206-L227), but comparing Steve’s numerator output and the output of xr.dot there are small differences e.g., ~2.14e-5.
Looking further into xArrays dot function (https://github.com/pydata/xarray/blob/aef5377a5af33a635a651123af7caa7435c24931/xarray/core/computation.py#L1625-L1767) they’re utilizing NumPy’s einsum function to perform the computation.

Still looking into why einsum vs sum(data*weights, axis=[1, 2]) produces differences.

Differences in weights (dtypes)

I’m also seeing a difference in the weights generated for averaging, the mean difference is ~4.25e-2 between xCDAT and CDAT.

Here’s the relevant CDAT code for generating weights (https://github.com/CDAT/cdms/blob/3f8c7baa359f428628a666652ecf361764dc7b7a/Lib/grid.py#L449-L463 and https://github.com/CDAT/genutil/blob/59517ab54e65c03098502f63434270f96020f3eb/Lib/averager.py#L237-L239).

It looks like the float64 is coming from the weights that CDAT generates.

I’ve traced it back to the function that generates bounds (https://github.com/CDAT/cdms/blob/3f8c7baa359f428628a666652ecf361764dc7b7a/Lib/axis.py#L2211-L2214).
This produces an array of float64 values, which propagates up to the output of the averager. Creating a new array with np.array(0.5).dtype confirmed the default on my machine was float64.

Interesting note numpy will pass through a dtype e.g., np.array(np.array(0.5, dtype=np.float32)).dtype == np.dtype32 but as in the function above if any computation is performed using a python float the dtype gets set as float64 e.g., np.array(0.5*np.array(0.5, dtype=np.float32)).dtype == np.float64.

As an additional test I took the weights generated by xCDAT whose dtype is float32 and passed them to the CDAT average. This resulted in an output whose dtype was float32. This also brought the mean difference down to ~2.57e-7.

durack1 Aug 19, 2022
Collaborator

@pochedls just an FYI, in reproducibility testing within the PMP (with changing versions of CDAT/CDMS2 underneath) reproducibility to precision was an ongoing pain, so I would caution dependence on the CDAT values without firm reasoning (you can reproduce this following xCDAT code that you've checked)

taylor13 · 2022-07-07T20:19:46Z

taylor13
Jul 7, 2022

For acceptable precision, the numerator and denominator that are needed to compute the weighted average must be computed with accumulators that are double precision, even if the underlying data is single precision. Is that the case in all the numpy, cdat, and xcdat calculations?

1 reply

pochedls Jul 7, 2022
Collaborator Author

This surprises me. I think my expectation was that if you cast the data to higher precision you would get a result that looked more precise but the results didn't actually have more precision. There is a type change with CDAT, though I'm not sure that it was intentional (I believe Jason found that the weights are double precision and the data array is cast to double precision during the numerator dot product). If we cast everything to double is the answer more precise, accurate, or both?

taylor13 · 2022-07-07T22:16:15Z

taylor13
Jul 7, 2022

it's just the accumulator that needs to be higher precision. Consider 100 numbers, all equal to 0.1, which are represented with a precision of 1 digit and which are summed in a loop. If you use an accumulator with precision 1 digit, you get:

the sum of the first 10 numbers = 1
but then when you add additional numbers to the accumulator you get
accum = accum +.1 = 1 + .1 = 1.1, but you only have a single digit of precision so this truncates to 1
adding subsequent values continues to result in a sum of 1.

The final sum = 1 when it should be 100*0.1 = 10.

If you had used an accumulator that was double the precision (i.e., 2 digits), then the correct answer would be obtained.

I see no reason to cast the original data in double precision, unless it's the only way to trick the python summing algorithm to use an accumulator that is double precision.

[For a more realistic case, consider adding 100,000 numbers, all about the same size. If the accumulator has a precision of 6 digits, the sum will be off by as much as 1%, while if the accumulator is double precision (12 digits), the sum will be off by only a tiny amount.]

0 replies

pochedls · 2022-08-18T01:24:13Z

pochedls
Aug 18, 2022
Collaborator Author

@lee1043 - would you be willing to look at how the temporal functions handle missing data (since you had outlined this functionality)?

I have never used CDAT's temporal averaging and so I don't have ready cases to compare CDAT versus xCDAT. I think this deserves a careful look, because I found an issue today (#319). I have created a draft fix, but I worry that other temporal operators may be problematic, too.

0 replies

lee1043 · 2022-08-18T02:42:11Z

lee1043
Aug 18, 2022
Collaborator

@pochedls thanks for pinging me. Yes, temporal average is one of routinely used capability in PMP, which is planned to be replaced by xCDAT. I will investigate the temporal functions.

0 replies

lee1043 · 2022-08-30T18:26:11Z

lee1043
Aug 30, 2022
Collaborator

I have compared xCDAT vs CDAT for yearly temporal average. I am planning to repeat the similar test for seasonal average as well.

It is interesting to see even the loaded number is different between them. See printed data value below, where xCDAT loaded -0.02 and CDAT loaded -0.019999999552965164, maybe that is because they use different floating points?

xCDAT

xCDAT version: latest main as of today (8/30/2022), after #320 merged

import xcdat
# open dataset
fn = '/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc' 
ds = xcdat.open_dataset(fn)
# fix missing calendar
time_encoding = ds.time.encoding
time_encoding['calendar'] = 'standard'
ds.time.encoding = time_encoding
# select grid cell in subsetted dataset
dss = ds.isel(lat=[11], lon=[0])
print('data:', dss.tempanomaly[0:12, 0, 0].values)
# yearly average
ds_yearly = dss.temporal.group_average('tempanomaly', freq='year', weighted=True)
print('ave:', ds_yearly.tempanomaly[0, 0, 0].values)

data: [  nan -0.02 -0.02   nan   nan   nan   nan   nan   nan   nan   nan   nan]
ave: -0.020000000018626448

CDAT

CDAT provides options on how to handle missing data in temporal average throughout criteriaarg, as below (copied and condensed from the UV-CDAT document).

criteriaarg (passed as a list) has 2 arguments:

The first argument represents the minimum fraction of time that is required to compute the temporal mean. So you can pass a fractional value between 0.0 and 1.0 (including both extremes) or even a representation such as 3.0/4.0 (in case you need at least 3 out of 4 months of data).
The second argument in the criteriaarg is None. This implies no “centroid function” is used. In other cases, this argument represents the maximum value of the “centroid function”. A value between 0 and 1 represents the spread of values across the mean time. The centroid value of 0.0 represents a full even distribution of data across the time interval.

By the way the document says "Default behaviour i.e criteriaarg=[0.5, None]" for the temporal average, which I don't think it is correct. Need further investigation on the cdutil's times.py code.

import cdms2
import cdutil
# open dataset
f = cdms2.open('/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc')
d = f('tempanomaly')
# select grid cell in subsetted dataset
ds = d[:, 11, 0]
print('data:', ds[0:12])

#
# yearly average
#
# Default behaviour
d_yearly_1 = cdutil.YEAR(ds)
print('ave (default):', d_yearly_1[0])

# Criteria to say compute annual average for any number of months.
d_yearly_2 = cdutil.YEAR(ds, criteriaarg = [0., None])
print('ave (any number):', d_yearly_2[0])

# Criteria 0.5 (which is described as default in the document but maybe not correct)
d_yearly_3 = cdutil.YEAR(ds, criteriaarg = [0.5, None])
print('ave (criteria 0.5):', d_yearly_3[0])

# Criteria for computing annual average based on the minimum number of months (8 out of 12).
d_yearly_4 = cdutil.YEAR(ds, criteriaarg = [8./12., None])
print('ave (criteria min 8 out of 12):', d_yearly_4[0])

# Same criteria as in 3, but we account for the fact that a year is cyclical i.e Dec and Jan are adjacent months.
# So the centroid is computed over a circle where Dec and Jan are contiguous.
d_yearly_5 = cdutil.YEAR(ds, criteriaarg = [8./12., 0.1, 'cyclical'])
print('ave (criteria min 8 out of 12, with cyclical):', d_yearly_5[0])

data: [-- -0.019999999552965164 -0.019999999552965164 -- -- -- -- -- -- -- -- --]
ave (default): -0.01999999955296516
ave (any number): -0.01999999955296516
ave (criteria 0.5): --
ave (criteria min 8 out of 12): --
ave (criteria min 8 out of 12, with cyclical): --

Difference between xCDAT and CDAT

The difference seems to be negligible, I think.

# Difference between xcdat and cdat results
diff_value = abs(d_yearly_1[0] - ds_yearly.tempanomaly[0, 0, 0].values)
diff_percent = (diff_value / ds_yearly.tempanomaly[0, 0, 0].values) * 100.
print("diff_value: {:.20f}".format(diff_value))
print("diff_percent: {:.8f} %".format(diff_percent))

diff_value: 0.00000000046566128731
diff_percent: -0.00000233 %

0 replies

lee1043 · 2022-08-30T18:43:20Z

lee1043
Aug 30, 2022
Collaborator

xCDAT vs CDAT, similar as above but for DJF climatology

import xcdat
# open dataset
fn = '/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc' 
ds = xcdat.open_dataset(fn)
# fix missing calendar
time_encoding = ds.time.encoding
time_encoding['calendar'] = 'standard'
ds.time.encoding = time_encoding
# select grid cell in subsetted dataset
dss = ds.isel(lat=[11], lon=[0])
# get DJF climatology
ds_clim_season = dss.temporal.climatology('tempanomaly', freq='season', weighted=True)
print('xcdat_djf_clim:', ds_clim_season.tempanomaly[0, 0, 0].values)

import cdms2
import cdutil
# open dataset
f = cdms2.open('/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc')
d = f('tempanomaly')
# select grid cell in subsetted dataset
ds = d[:, 11, 0]
# get DJF climatology
d_clim_season = cdutil.DJF.climatology(ds)
print('cdat_djf_clim:', d_clim_season[0])

# Difference between xcdat and cdat results
diff_value = abs(d_clim_season[0] - ds_clim_season.tempanomaly[0, 0, 0].values)
diff_percent = (diff_value / abs(ds_clim_season.tempanomaly[0, 0, 0].values)) * 100.
print("diff_value: {:.20f}".format(diff_value))
print("diff_percent: {:.8f} %".format(diff_percent))

xcdat_djf_clim: -0.10927703728339967
cdat_djf_clim: -0.10937315202547197
diff_value: 0.00009611474207230075
diff_percent: 0.08795511 %

7 replies

pochedls Aug 30, 2022
Collaborator Author

This is also a simple sanity check – if large differences exist, then they can be investigated to determine why they exist (which is what we have painstakingly done for many functions).

taylor13 Aug 31, 2022

There might be one exception I can think of. In CMIP a single precision (32 bit, I think) floating point representation of 1.e20 is saved in the netCDF files to represent "missing" data. It is stored both as the "missing_value" attribute in the file "header" and also within arrays where data are missing. If these two data values are not handled similarly by xcdat (e.g., one is caste to 64 bit while the other isn't, I think it likely that the testing for missing values will go awry. ("is array(i) = missing?" will not return the correct answer unless both values are exactly the same.)

pochedls Aug 31, 2022
Collaborator Author

xarray replaces the _FillValue and missing_value with NaN values when it load in the data: https://docs.xarray.dev/en/stable/generated/xarray.open_dataarray.html (see mask_and_scale: https://docs.xarray.dev/en/stable/generated/xarray.open_dataarray.html).

taylor13 Aug 31, 2022

should be o.k. then. thanks.

pochedls Aug 31, 2022
Collaborator Author

Thank you for thinking this through with us!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Investigate xcdat handling floating point data types and missing data #275

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Task]: Investigate xcdat handling floating point data types and missing data #275

pochedls May 27, 2022 Collaborator

Describe the task

Replies: 10 comments · 10 replies

taylor13 Jun 3, 2022

taylor13 Jun 3, 2022

tomvothecoder Jul 6, 2022 Maintainer

tomvothecoder Jul 6, 2022 Maintainer

tomvothecoder Jul 6, 2022 Maintainer

@tomvothecoder follow up (6/28/22)

What is the input dtype for the variable he was working with (“tempanomaly”)?

What is the output dtype of ds.spatial.average()?

What is the output dtype of cdutil.averager()/genutil.averager()?

@jasonb5's follow up (7/1/22)

np.sum vs. np.einsum

Differences in weights (dtypes)

durack1 Aug 19, 2022 Collaborator

taylor13 Jul 7, 2022

pochedls Jul 7, 2022 Collaborator Author

taylor13 Jul 7, 2022

pochedls Aug 18, 2022 Collaborator Author

lee1043 Aug 18, 2022 Collaborator

lee1043 Aug 30, 2022 Collaborator

xCDAT

CDAT

Difference between xCDAT and CDAT

lee1043 Aug 30, 2022 Collaborator

pochedls Aug 30, 2022 Collaborator Author

taylor13 Aug 31, 2022

pochedls Aug 31, 2022 Collaborator Author

taylor13 Aug 31, 2022

pochedls Aug 31, 2022 Collaborator Author

pochedls
May 27, 2022
Collaborator

Replies: 10 comments 10 replies

taylor13
Jun 3, 2022

taylor13
Jun 3, 2022

tomvothecoder
Jul 6, 2022
Maintainer

tomvothecoder
Jul 6, 2022
Maintainer

tomvothecoder Jul 6, 2022
Maintainer

`np.sum` vs. `np.einsum`

durack1 Aug 19, 2022
Collaborator

taylor13
Jul 7, 2022

pochedls Jul 7, 2022
Collaborator Author

taylor13
Jul 7, 2022

pochedls
Aug 18, 2022
Collaborator Author

lee1043
Aug 18, 2022
Collaborator

lee1043
Aug 30, 2022
Collaborator

lee1043
Aug 30, 2022
Collaborator

pochedls Aug 30, 2022
Collaborator Author

pochedls Aug 31, 2022
Collaborator Author

pochedls Aug 31, 2022
Collaborator Author