Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent calculation of standard deviation between preprocessors #1024

Closed
stefsmeets opened this issue Feb 26, 2021 · 4 comments
Closed

Comments

@stefsmeets
Copy link
Contributor

stefsmeets commented Feb 26, 2021

Describe the bug

Hi everyone, I noticed that iris.analysis.STD_DEV returns a different result than np.ma.std. Both are used throughout the code, but the result is inconsistent.

Although iris uses the same underlying function (np.ma.std) to calculate the standard deviation, they use a default setting for ddof=1, instead of 0, which is the default in numpy. More info: https://numpy.org/doc/stable/reference/generated/numpy.std.html

For example, the implementation of the multimodel_statistics uses uses np.ma.std, wheras climate_statistics uses np.iris.analysis.STD_DEV. This means that they will return a different value when the user specifies operator: std in their recipe, which may be confusing.

I would just like some input on what to do with this in view of #968. Since we seem to be aligning our code more and more with iris, do we also align our implementation of std with the iris defaults?

Example:

In [1]: import numpy as np

In [2]: data = np.ma.array(
    data=[[  1,   1,   1],
          [999,   5,   5],
          [  9,   9, 999]],
    mask=[[False, False, False],
          [ True, False, False],
          [False, False,  True]])

In [3]: # numpy default behaviour

In [4]: np.ma.std(data, axis=0)
Out[4]:
masked_array(data=[4.0, 3.26, 2.0],
             mask=[False, False, False],
       fill_value=1e+20)

In [5]: import iris.analysis

In [6]: # iris default behaviour

In [7]: iris.analysis.STD_DEV.aggregate(data, axis=0)  
Out[7]:
masked_array(data=[5.65, 4.0, 2.82],
             mask=[False, False, False],
       fill_value=1e+20)

In [8]: # behaviour reproduced in numpy (with `ddof=1`, the default in iris 3.0.1 `STD_DEV`)

In [9]: np.ma.std(data, axis=0, ddof=1)
Out[9]:
masked_array(data=[5.65, 4.0, 2.82],
             mask=[False, False, False],
       fill_value=1e+20)
@valeriupredoi
Copy link
Contributor

wait - you saying that there could be differences of about 50% in stdev calculations? You should probably raise this first with iris folk eg @bjlittle and see why they are not using the default degs of freedom, they rewrite freedom in another way 🇨🇳 😁

@schlunma
Copy link
Contributor

schlunma commented Mar 1, 2021

@valeriupredoi you need to be careful here. it's hard to define a "default" setting in this context.

To quote numpy here: "In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se."

So if you want to estimate a population's standard deviation with only a sample, you need this "Bessel correction". In my opinion this feels like the correct choice when we deal with a multi-model ensemble of climate models, so I agree with iris here.

@valeriupredoi
Copy link
Contributor

and that is a very good explanation @schlunma - many thanks for it! Sounds to me like we should follow Chairman Iris in this case then 😁

@schlunma
Copy link
Contributor

I think we can close this now, after the refactoring of multi_model_statistics I can't find a reference to np.std or np.ma.std in the code anymore. Please re-open if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants