Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis.stats.pearsonr and masked arrays #1534

Closed
rcomer opened this issue Jan 22, 2015 · 4 comments
Closed

analysis.stats.pearsonr and masked arrays #1534

rcomer opened this issue Jan 22, 2015 · 4 comments

Comments

@rcomer
Copy link
Member

rcomer commented Jan 22, 2015

Suppose you have two matching sets of data, but with missing values in different places:

import iris
import numpy as np
import numpy.ma as npma

mask1 = np.zeros((7), dtype=bool)
mask2 = np.zeros((7), dtype=bool)
mask1[2]=True
mask2[4]=True

cube1 = iris.cube.Cube(npma.MaskedArray(range(7), mask=mask1),
                       dim_coords_and_dims = 
                       [(iris.coords.DimCoord(range(7), long_name='blah'),0)])

cube2 = iris.cube.Cube(npma.MaskedArray(range(7), mask=mask2),
                       dim_coords_and_dims =
                       [(iris.coords.DimCoord(range(7), long_name='blah'),0)])

The correlation should be 1, but you get different values depending on which function you use:

import iris.analysis.stats as istats
import scipy.stats.mstats as spsm

print istats.pearsonr(cube1, cube2).data
print spsm.pearsonr(cube1.data, cube2.data)
print npma.corrcoef(cube1.data, cube2.data)

The npma function gives 1.0, but the iris and scipy functions both give 0.963... The scipy function already has a lot of discussion over here: scipy/scipy#3645.

@esc24
Copy link
Member

esc24 commented Jan 22, 2015

@niallrobinson added the pearsonr functionality to Iris. I'd be interested in his thoughts.

@niallrobinson
Copy link
Contributor

Its not something that I remember being aware of. My initial thoughts are that your statement

The correlation should be 1
isn't as obvious as it first sounds. Pearson's r has got a dependence on the length of the arrays, which is ambiguous in this case

That said, the expected behaviour is probably what you describe, and what they landed on on the scipy discussion i.e. your effective datasets are arrays A and B both masked with maskA OR maskB. I'll make a PR

@rcomer
Copy link
Member Author

rcomer commented Sep 29, 2015

Now that #1748 is merged I guess we can close this?

@ajdawson
Copy link
Member

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants