[Task]: Investigate xcdat handling floating point data types and missing data #275
Replies: 10 comments 10 replies
-
Regarding the question of a single missing month in a seasonal average, it may not make much practical difference. But when computing an annual mean, if all the winter or all the summer months were missing, then you could get a quite biased result (for things like temperature, at least). For the annual mean calculation in CDAT we came up with two criteria (set by the user) that would determine whether a mean was recorded or the value was set to missing: 1) the threshold fraction of samples required to compute the mean, and 2) how far the "centroid" of weights computed from available samples differed from the centroid of weights computed assuming no missing samples. If you consider monthly data as numbers on a clock, then for no missing data the centroid lies at the axis of the clock hands. Similarly, if data were only available for 4 months but equally distributed (say, January, April, July, and October), the centroid would still be at the center. But if the 4 months all occurred in one half of the year, then the centroid would be offset. When computing an annual mean, the user would specify what minimum number of the months was required and how close the centroid should be to the center of the clock. I would be happy to discuss further. |
Beta Was this translation helpful? Give feedback.
-
Another suggestion: Sometimes carrying both the "unmasked" weights and a second array of "masking" factors can be useful when constructing algorithms. When computing means, you would then calculate the sum-over-samples(wts x msk x data) and divide by sum-over-samples(wts x msk). In general "msk" would be a fraction (set to 0 for missing values). When regridding conservatively, the "wts" would be set to the area of each grid cell and the "msk" would indicate the fraction of each grid cell that was unmasked. The output of the regridder would give the area of each target grid cell and the unmasked fraction of each target cell, along with the regridded field itself. |
Beta Was this translation helpful? Give feedback.
-
I am going to convert this issue to a discussion to make it easier to follow individual threads. |
Beta Was this translation helpful? Give feedback.
-
@pochedls continued investigating the floating point differences in spatial averaging between CDAT vs. xCDAT. Per Steve on 6/22/22 over email: I think I am (kind of) understanding these small differences in spatial averaging. The mystery is that if I perform a weighted average myself, I match CDAT, but I differ from xcdat (order 0.002 differences for values of ~270 – small, but I would have thought we would be slightly more accurate). A spatial average is just a weighted average where: WA = sum(DA * W) / sum(W) where DA is the dataarray, W is a weighting matrix, and WA is the resulting weighted average. We can write this as: WA = num / den where num = sum(DA * W) and den = sum(W). I found that my own numerator and denominator values were not matching xarray (here), which are ultimately implemented as two separate xarray dot products (here) via the _reduce function (here). The underlying data I am looking at is type float (it shows up as float32 in the dataarray). Weirdly, I can match CDAT and my own calculations if I specify dtype=float64: numerator_xr = xr.dot(da, weights, dims=['lat', 'lon'], dtype=np.float64) I couldn’t find anything about numpy operations being performed at double precision (even if data is single precision), but I think this suggests that these differences may be okay. ....here is my spatial averaging code. This yields the same result as CDAT within 7E-5 whereas xcdat is off by 2E-3 for this file (note that I specified an offset since this is an anomaly dataset with values close to zero and I wanted to get around some of the absolute/relative tolerance issues we discussed today).
|
Beta Was this translation helpful? Give feedback.
-
For acceptable precision, the numerator and denominator that are needed to compute the weighted average must be computed with accumulators that are double precision, even if the underlying data is single precision. Is that the case in all the numpy, cdat, and xcdat calculations? |
Beta Was this translation helpful? Give feedback.
-
it's just the accumulator that needs to be higher precision. Consider 100 numbers, all equal to 0.1, which are represented with a precision of 1 digit and which are summed in a loop. If you use an accumulator with precision 1 digit, you get: the sum of the first 10 numbers = 1 The final sum = 1 when it should be 100*0.1 = 10. If you had used an accumulator that was double the precision (i.e., 2 digits), then the correct answer would be obtained. I see no reason to cast the original data in double precision, unless it's the only way to trick the python summing algorithm to use an accumulator that is double precision. [For a more realistic case, consider adding 100,000 numbers, all about the same size. If the accumulator has a precision of 6 digits, the sum will be off by as much as 1%, while if the accumulator is double precision (12 digits), the sum will be off by only a tiny amount.] |
Beta Was this translation helpful? Give feedback.
-
@lee1043 - would you be willing to look at how the temporal functions handle missing data (since you had outlined this functionality)? I have never used CDAT's temporal averaging and so I don't have ready cases to compare CDAT versus xCDAT. I think this deserves a careful look, because I found an issue today (#319). I have created a draft fix, but I worry that other temporal operators may be problematic, too. |
Beta Was this translation helpful? Give feedback.
-
@pochedls thanks for pinging me. Yes, temporal average is one of routinely used capability in PMP, which is planned to be replaced by xCDAT. I will investigate the temporal functions. |
Beta Was this translation helpful? Give feedback.
-
I have compared xCDAT vs CDAT for yearly temporal average. I am planning to repeat the similar test for seasonal average as well. It is interesting to see even the loaded number is different between them. See printed xCDATxCDAT version: latest main as of today (8/30/2022), after #320 merged import xcdat
# open dataset
fn = '/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc'
ds = xcdat.open_dataset(fn)
# fix missing calendar
time_encoding = ds.time.encoding
time_encoding['calendar'] = 'standard'
ds.time.encoding = time_encoding
# select grid cell in subsetted dataset
dss = ds.isel(lat=[11], lon=[0])
print('data:', dss.tempanomaly[0:12, 0, 0].values)
# yearly average
ds_yearly = dss.temporal.group_average('tempanomaly', freq='year', weighted=True)
print('ave:', ds_yearly.tempanomaly[0, 0, 0].values)
CDATCDAT provides options on how to handle missing data in temporal average throughout
By the way the document says "Default behaviour i.e criteriaarg=[0.5, None]" for the temporal average, which I don't think it is correct. Need further investigation on the import cdms2
import cdutil
# open dataset
f = cdms2.open('/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc')
d = f('tempanomaly')
# select grid cell in subsetted dataset
ds = d[:, 11, 0]
print('data:', ds[0:12])
#
# yearly average
#
# Default behaviour
d_yearly_1 = cdutil.YEAR(ds)
print('ave (default):', d_yearly_1[0])
# Criteria to say compute annual average for any number of months.
d_yearly_2 = cdutil.YEAR(ds, criteriaarg = [0., None])
print('ave (any number):', d_yearly_2[0])
# Criteria 0.5 (which is described as default in the document but maybe not correct)
d_yearly_3 = cdutil.YEAR(ds, criteriaarg = [0.5, None])
print('ave (criteria 0.5):', d_yearly_3[0])
# Criteria for computing annual average based on the minimum number of months (8 out of 12).
d_yearly_4 = cdutil.YEAR(ds, criteriaarg = [8./12., None])
print('ave (criteria min 8 out of 12):', d_yearly_4[0])
# Same criteria as in 3, but we account for the fact that a year is cyclical i.e Dec and Jan are adjacent months.
# So the centroid is computed over a circle where Dec and Jan are contiguous.
d_yearly_5 = cdutil.YEAR(ds, criteriaarg = [8./12., 0.1, 'cyclical'])
print('ave (criteria min 8 out of 12, with cyclical):', d_yearly_5[0])
Difference between xCDAT and CDATThe difference seems to be negligible, I think. # Difference between xcdat and cdat results
diff_value = abs(d_yearly_1[0] - ds_yearly.tempanomaly[0, 0, 0].values)
diff_percent = (diff_value / ds_yearly.tempanomaly[0, 0, 0].values) * 100.
print("diff_value: {:.20f}".format(diff_value))
print("diff_percent: {:.8f} %".format(diff_percent))
|
Beta Was this translation helpful? Give feedback.
-
xCDAT vs CDAT, similar as above but for DJF climatology import xcdat
# open dataset
fn = '/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc'
ds = xcdat.open_dataset(fn)
# fix missing calendar
time_encoding = ds.time.encoding
time_encoding['calendar'] = 'standard'
ds.time.encoding = time_encoding
# select grid cell in subsetted dataset
dss = ds.isel(lat=[11], lon=[0])
# get DJF climatology
ds_clim_season = dss.temporal.climatology('tempanomaly', freq='season', weighted=True)
print('xcdat_djf_clim:', ds_clim_season.tempanomaly[0, 0, 0].values)
import cdms2
import cdutil
# open dataset
f = cdms2.open('/p/user_pub/climate_work/pochedley1/surface/gistemp1200_GHCNv4_ERSSTv5.nc')
d = f('tempanomaly')
# select grid cell in subsetted dataset
ds = d[:, 11, 0]
# get DJF climatology
d_clim_season = cdutil.DJF.climatology(ds)
print('cdat_djf_clim:', d_clim_season[0])
# Difference between xcdat and cdat results
diff_value = abs(d_clim_season[0] - ds_clim_season.tempanomaly[0, 0, 0].values)
diff_percent = (diff_value / abs(ds_clim_season.tempanomaly[0, 0, 0].values)) * 100.
print("diff_value: {:.20f}".format(diff_value))
print("diff_percent: {:.8f} %".format(diff_percent))
|
Beta Was this translation helpful? Give feedback.
-
Describe the task
Much of the geospatial data that we use has missing / masked values. This issue is intended to make sure that missing data is properly handled by xcdat. In particular, we need to make sure that xcdat handles missing data correctly for:
Notes on weighted averaging
Spatial and temporal averages are weighted averages. Spatial averages are typically weighted by the area in each grid cell and temporal averages are weighted by the length of time for each time interval. In general, a weighted average is just:
WA(x) = Σ(w(x) * v(x)) / Σ(w(x))
where WA is the weighted average at time/location, x, for a given weight, w, and value, v.
If a value is missing, its weight should be set to zero. For example, if I have arrays
v = [99, 80, 77, 92, 87]
andw = [10, 10, 10, 10, 30]
, I will getWA = 87.0
(think a weighted grade average with homeworks worth 10 points and a quiz worth 30).Now suppose the teacher says that I can miss one homework, then I have arrays
v = [99, 80, np.nan, 92, 87]
andw = [10, 10, 10, 10, 30]
, I will getWA = 76.0
[or nan if I don't use np.nansum]. This is impossible, because I don't have any grades below 87 – so I can't get an average below 87. This is because the quiz that has anan
value is still being weighted. I need to zero out that weight (w = [10, 10, np.nan, 10, 30]
), yieldingWA = 88.67
.The take home message is that we need to ensure that values that are missing / masked are zero-ed out for spatial and temporal averaging.
The current spatial and temporal averages utilize the xarray
.weighted().mean()
API, which generally appear to handle missing data appropriately, though special attention may be needed for groupby averaging operations (used in temporal averaging). One question is whether a group of values that include a missing value (e.g., May, June, July temperature) should return a NaN or weighted average of the available data.These notes are subsetted from a conversation with @tomvothecoder
Beta Was this translation helpful? Give feedback.
All reactions