Re-write pearsonr to allow broadcasting and other aggregator keyword arguments #1714

rcomer · 2015-06-26T13:58:31Z

Enables specification of weights keyword in iris.analysis.stats.pearsonr, to calculate a weighted correlation. I want to use it for spatial correlations.

In the unit tests, I changed the "perfect correlation" tests from assertArrayEqual to assertArrayAlmostEqual because I only got 1 to seven decimal places.

ajdawson · 2015-06-26T14:22:27Z

lib/iris/analysis/stats.py

@@ -111,7 +120,8 @@ def pearsonr(cube_a, cube_b, corr_coords=None):
    For example providing two time/altitude/latitude/longitude
    cubes and corr_coords of 'latitude' and 'longitude' will result
    in a time/altitude cube describing the latitude/longitude
-    (i.e. pattern) correlation at each time/altitude point.
+    (i.e. pattern) correlation at each time/altitude point. Area
+    weights may be set using :func:`iris.analysis.cartography.area_weights`


Whilst this is true, it makes it sounds like that is the only way weights might be generated. It would be more appropriate to just mention that weights may be provided if required. You can mention area weighting as an example where the weights keyword itself is documented if you like.

ajdawson · 2015-06-26T14:31:09Z

The change itself seems reasonable. However, it would be really nice if the testing could be done according to the new unit testing guidelines: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html.

I don't want to put you off, but I'll explain (what I think is) the current position: new tests should not be added to the old unit tests, so you would at least need to add the new tests in lib/iris/tests/unit/analysis/stats/test_pearsonr.py. If you have trouble with this we'd be happy to help.

(For bonus points you can move the existing pearsonr tests there too!)

rcomer · 2015-06-26T15:23:18Z

Thanks @ajdawson. Since test_stats.py currently only tests one function, is it simply a matter of moving and renaming the file, and changing the docstring? Or is there something more involved? I suspect there ought to be some tests for _get_calc_view too, but I'm not sure I want to tackle that. Should I remove or revert the old test_stats.py file?

Everything else you've suggested is straightforward.

ajdawson · 2015-06-26T15:43:06Z

It might be easier to just move the lot. Should be a fairly simple rename of file and docstring + whatever the developer guide suggests for class/method names, as you suggested.

Don't worry too much about a new test class for the helper, as long as we don't lose test coverage in the move I'll be happy with it.

ajdawson · 2015-06-26T17:46:05Z

Ah, I've only just spotted a fairly big issue (but not your fault @rcomer, the problem existed before you touched the code). The test loads data from the iris-sample-data repository. This is a no-no, all tests using data should use data in the iris-test-data repository.

As for how to approach this I'm not sure right now, I'll need to have a think about it. We could just copy the relevant files into the iris-test-data repository, but it might be better if there was already some test data that could be used in place of the GloSea data.

ajdawson · 2015-06-26T17:49:42Z

You'll also need to add an lib/iris/tests/unit/analysis/stats/__init__.py file otherwise the test suite inside won't be run.

ajdawson · 2015-06-26T17:50:21Z

but it might be better if there was already some test data that could be used in place of the GloSea data.

Or even just some dummy data without relying on a file at all. I'll have a think.

rcomer · 2015-06-27T07:11:10Z

Having given this some more thought, I'm wondering if a complete re-write of pearsonr might be more sensible. Along these lines:

def pearsonr(cube_a, cube_b, corr_coords=None, weights=None):
    sa = cube_a - cube_a.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    sb = cube_b - cube_b.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)

    covar = (sa*sb).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    var_a = (sa**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    var_b = (sb**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)

    denom = iris.analysis.maths.apply_ufunc(np.sqrt, var_a*var_b, new_unit=covar.units)
    corr_cube = covar / denom    
    corr_cube.rename("Pearson's r")

    return corr_cube

With the collapsed method doing all the complicated data-rearranging stuff, there is far less that could go wrong.

ajdawson · 2015-06-29T09:45:58Z

Having given this some more thought, I'm wondering if a complete re-write of pearsonr might be more sensible.

Well, I can't argue with the readability improvement that would bring. Would it solve the weighting issue too? I guess legitimate concerns might be about performance, do you know if there would be any significant change in how quickly this would compute the correlation?

rcomer · 2015-06-29T10:50:39Z

It does appear to be slower. I haven't tested properly, but running test_pearsonr.py a few times on the current version of my branch takes ~0.8s each time. Running it on the re-write takes ~1s each time.

Edit to add: yes, it would solve the weighting issue.

My last commit failed with TimedOutException. Not sure why.

ajdawson · 2015-06-29T12:21:29Z

My last commit failed with TimedOutException. Not sure why.

I've restarted it, hopefully it will be fine this time.

ajdawson · 2015-06-29T12:22:06Z

It does appear to be slower. I haven't tested properly, but running test_pearsonr.py a few times on the current version of my branch takes ~0.8s each time. Running it on the re-write takes ~1s each time.

What size input is that out of interest? I'm wondering if it might still be worth it.

ajdawson · 2015-06-29T12:33:16Z

Also, an added benefit of this is that I'm pretty sure it allows broadcasting, so I can perform correlation over time of two cubes [time] and [time, lat, lon] and the result should be the correlation map [lat, lon]... This is a big win in my eyes.

rcomer · 2015-06-29T12:45:26Z

The Glosea4 data used in the test are 6x145x192 (time x lat x lon).

I had a look at the broadcasting, and it seems to work beautifully. I need to think a bit more about how it would work with weights though.

ajdawson · 2015-06-29T13:02:18Z

I think the broadcasting is actually quite a big feature and I'd be happy to sacrifice a small amount of performance to accommodate it.

For handling missing values the appropriate way to do it may be debatable. In this proposed implementation the mean and variances for each input would be calculated using all available data, whereas the covariance terms would be computed with only the data locations that are non-missing in both inputs. Whilst this may seem inconsistent (and may in a rare case result in absolute correlations greater than 1), I'd argue that this is acceptable, since you should use the best possible estimate you can for the mean and variance of each variable. This echoes the given in advice in Chatfield [The Analysis of Time Series: An Introduction, Chapman and Hall/CRC] regarding how to compute the statistics of each input into a lagged correlation (i.e. you compute the mean using all N elements of the time series even though N - k values are actually used in the correlation, where N is the length of the original time series and k is the lag). If this is acceptable then I think we might be home and dry already!

rcomer · 2015-06-29T15:29:00Z

Updated branch with re-write. This works for the test examples at least.

Lines 68-73 and 81-90 in stats.py are there purely to catch the case when both broadcasting and weights are used. It would be nice to find a neater way if possible. I'm also not sure what to do for cases where a cube has anonymous dimensions (though I don't think the original pearsonr would have handled that either).

ajdawson · 2015-06-29T15:56:22Z

Fast work, I'm struggling to keep up! I have a handful of minor comments which I will add later. For now I think the only major things to think about are:

perhaps allowing other collapsed keywords to be passed through to pearsonr, I'm thinking of mdtol specifically.
Testing this without using the sample data (although I am pleased to see the existing tests pass with this refactor).

rcomer · 2015-06-29T16:35:03Z

Is there some documentation listing/describing what's in the test data files?

ajdawson · 2015-06-29T18:18:40Z

Good question. I think the answer is no though. I just used lscubes to have a quick scan, looks like there might be something appropriate in test_data/NetCDF/global/xyt/SMALL_total_column_co2.nc on first glance but you might find something better.

ajdawson · 2015-06-30T11:36:39Z

lib/iris/analysis/stats.py

@@ -22,108 +22,40 @@
 from __future__ import (absolute_import, division, print_function)

 import numpy as np
+import numpy.ma as ma


Unused import I think.

rcomer · 2015-07-01T10:36:35Z

Thanks for the tips, I think I've covered everything now.

I take your point about the treatment of missing data being arguable. I guess mask-matching could be added in as an option, making it a user decision. Think I'd rather leave that for a future PR though!

ajdawson · 2015-07-01T12:31:35Z

I guess mask-matching could be added in as an option, making it a user decision. Think I'd rather leave that for a future PR though!

Absolutely, it is out of scope for this PR.

ajdawson · 2015-07-01T12:36:27Z

lib/iris/analysis/stats.py

-        Cubes should be the same shape and have the
-        same dimension coordinates.
+        Between which the correlation field will be calculated.  Cubes should
+        be the same shape and have the same dimension coordinates or cube_b


Is there an easy way to remove the ordering restriction? Where do you see the error if these are the wrong way round?

Given cubes a and b, where a has the smaller dimensionality, a * b fails in iris.analysis.maths._assert_compatible with ValueError: The array operation would increase the dimensionality of the cube. The new cubes data would have had to become: [dimensions of b].

Can we just find the smaller one and swap the inputs if necessary? If you make a temporary assignment it just creates a reference so doing something like:

if cube_a.ndim < cube_b.ndim: a = cube_b b = cube_a else: a = cube_a b = cube_b

(don't care about the naming, just an example) and then use a and b instead of cube_a and cube_b won't actually create copies of cube_a and cube_b so there are no extra memory issues to account for.

OK, that's easy enough. Done.

ajdawson · 2015-07-02T09:09:10Z

@rcomer - I've made a few more minor comments. The documentation comments may refer to things you didn't write originally, but I think since this is a total rewrite it might be worth addressing them now. I also want to reassure you that although this is taking a long time, the work is of very high quality and I'm very happy that you put in the time to do it all!

ajdawson · 2015-07-02T13:54:53Z

@rcomer - I changed the PR title to reflect the new scope of the PR.

Pinging @niallrobinson in case you are interested in this.

rcomer · 2015-07-03T10:12:24Z

Thanks @ajdawson, I'm happy with your suggestions and have made those changes.

ajdawson · 2015-07-03T12:40:01Z

OK this looks in nice shape now. I'd like you to squash this into 1 commit (with a commit message that reflects the new scope of the change) before it is merged @rcomer.

If you haven't done this before it is simply a case of using the rebase command in interactive mode, instructions here: http://scitools.org.uk/iris/docs/latest/developers_guide/gitwash/development_workflow.html?highlight=rebase#rewriting-commit-history. Making a backup branch first is never a bad idea. If you need help please ask.

rhattersley · 2015-07-03T12:53:17Z

Making a backup branch first is never a bad idea.

If you've not done it (much) before I'd say "Making a backup branch first is an excellent idea."

rcomer · 2015-07-03T15:40:24Z

I'd never done it before so thanks for the warnings! I think it worked...

ajdawson · 2015-07-07T09:20:20Z

@rcomer - Sorry for the delay, I was keen to test this thoroughly since it is a major change to an existing function. So far I think I'm convinced the original functionality is duplicated here, the only potential difference is metadata. This new version has more of it essentially, since it captures changes to AuxCoords due to collapsing. This has an outside chance of causing a headache to someone, but I don't expect it to be a problem that needs a solution now, especially not given the added benefits of your method (just discovered it works for ORCA grids too which is very cool! We couldn't do that before.). Therefore, no further action required.

However, since #1700 was just merged, this branch will need to be rebased against the current master as it can no longer be merged automatically. The test files will also need to be modified since they are new. #1700 just adds some imports to the top of most files to make sure they are Python2 and 3 compatible.

The following should be at the top of the new test files directly below the from __future__ ... imports:

from six.moves import (filter, input, map, range, zip)  # noqa

I'd suggest you make these changes first and commit it by re-writing the previous commit, since there is no need to see the change as a separate commit:

$ git commit -a --amend

You then need to do a rebase to integrate your changes to stats.py with the change from #1700. A rebase would probably look something like this:

$ git checkout weight-pearsonr
$ git fetch upstream
$ git rebase upstream/master

You might get a message about a conflict which will need to be resolved by opening the file(s) indicated by git and looking for the sections marked as in conflict, then editing these so they make sense (essentially just so that the new import is at the top). You then commit the result and use

$ git rebase --continue

If there are no conflicts then the last step won't be required. As before, do make a backup of your branch first in case of emergency (it also gives you that nice fuzzy feeling of knowing it doesn't matter if you make a mistake, so you can learn new git features without any stress!).

You may notice I used a reference to something called upstream earlier, which assumes you have a remote named upstream which refers to the SciTools iris repository, you can check with

git remote -v

and if you don't have any remote referring to github.com/SciTools/iris then you can create one with

git remote add upstream git://github.com/SciTools/iris.git

(you may need to use the http protocol instead of git:// depending on your network configuration at work)

As always, please ask if you need help. These things sound a little daunting at first but once you figure it out it isn't so bad.

ajdawson · 2015-07-07T09:42:09Z

This also fixes #1598!

rcomer · 2015-07-07T10:54:44Z

Thanks @ajdawson for the detailed instructions. The only conflict was the existence of the old test_stats.py file, so it was pretty straight forward.

ajdawson · 2015-07-07T11:39:49Z

lib/iris/analysis/stats.py

+        if weights.shape != cube_2.shape:
+            raise ValueError("weights array should have dimensions {}".
+                             format(cube_2.shape))
+        dims_1_common = [i for i in xrange(cube_1.ndim) if


I think this needs to be just range rather than xrange for Python 3 compatibility.

ajdawson · 2015-07-07T11:43:50Z

I'm so sorry @rcomer, but I need you to make a few more very quick changes! I dropped the ball here I should have spotted them earlier. We're trying to make iris Python 2 and 3 compatible, but currently we don't have the test suite ready for Python 3. In the mean time we have to manually check for things that are not Python 3 compatible and I missed these ones previously. If I let these slide @QuLogic will be 😞

Re-write pearsonr to allow broadcasting and other aggregator keyword arguments

ajdawson · 2015-07-07T12:21:36Z

This PR represents a significant improvement to functionality, and a particularly strong effort from @rcomer. Good work! 🎉

rcomer · 2015-07-07T12:22:10Z

Done, and learned something about Python3 along the way!

rhattersley · 2015-07-07T13:41:32Z

This PR represents a significant improvement to functionality, and a particularly strong effort from @rcomer.

Well put! 👍 Top work @rcomer, and thanks for guiding this through @ajdawson.

QuLogic · 2015-07-07T21:18:49Z

If I let these slide @QuLogic will be 😞

Thanks @ajdawson 😄

ajdawson reviewed Jun 26, 2015
View reviewed changes

ajdawson reviewed Jun 30, 2015
View reviewed changes

ajdawson reviewed Jul 1, 2015
View reviewed changes

ajdawson changed the title ~~Weights keyword for pearsonr~~ Re-write pearsonr to allow broadcasting and other aggregator keyword arguments Jul 2, 2015

ajdawson self-assigned this Jul 3, 2015

rcomer force-pushed the weight-pearsonr branch from eefaca9 to 137c2fc Compare July 3, 2015 15:37

ajdawson mentioned this pull request Jul 7, 2015

Changed Pearson r to handle masked data in a more reasonable way #1540

Closed

rcomer force-pushed the weight-pearsonr branch from 137c2fc to 3ffd02a Compare July 7, 2015 10:38

ajdawson reviewed Jul 7, 2015
View reviewed changes

pearsonr broadcasting and kwarg functionality

9c522d7

rcomer force-pushed the weight-pearsonr branch from 3ffd02a to 9c522d7 Compare July 7, 2015 12:09

ajdawson added a commit that referenced this pull request Jul 7, 2015

Merge pull request #1714 from rcomer/weight-pearsonr

2791388

Re-write pearsonr to allow broadcasting and other aggregator keyword arguments

ajdawson merged commit 2791388 into SciTools:master Jul 7, 2015

ajdawson mentioned this pull request Jul 7, 2015

Some tests use sample data #1598

Closed

rcomer deleted the weight-pearsonr branch July 7, 2015 12:40

rcomer mentioned this pull request Dec 14, 2023

Re-write pearsonr to use Resolve #5638

Merged

Re-write pearsonr to allow broadcasting and other aggregator keyword arguments #1714

Re-write pearsonr to allow broadcasting and other aggregator keyword arguments #1714

Conversation

rcomer commented Jun 26, 2015

ajdawson Jun 26, 2015

Choose a reason for hiding this comment

ajdawson commented Jun 26, 2015

rcomer commented Jun 26, 2015

ajdawson commented Jun 26, 2015

ajdawson commented Jun 26, 2015

ajdawson commented Jun 26, 2015

ajdawson commented Jun 26, 2015

rcomer commented Jun 27, 2015

ajdawson commented Jun 29, 2015

rcomer commented Jun 29, 2015

ajdawson commented Jun 29, 2015

ajdawson commented Jun 29, 2015

ajdawson commented Jun 29, 2015

rcomer commented Jun 29, 2015

ajdawson commented Jun 29, 2015

rcomer commented Jun 29, 2015

ajdawson commented Jun 29, 2015

rcomer commented Jun 29, 2015

ajdawson commented Jun 29, 2015

ajdawson Jun 30, 2015

Choose a reason for hiding this comment

rcomer commented Jul 1, 2015

ajdawson commented Jul 1, 2015

ajdawson Jul 1, 2015

Choose a reason for hiding this comment

rcomer Jul 1, 2015

Choose a reason for hiding this comment

ajdawson Jul 1, 2015

Choose a reason for hiding this comment

rcomer Jul 1, 2015

Choose a reason for hiding this comment

ajdawson commented Jul 2, 2015

ajdawson commented Jul 2, 2015

rcomer commented Jul 3, 2015

ajdawson commented Jul 3, 2015

rhattersley commented Jul 3, 2015

rcomer commented Jul 3, 2015

ajdawson commented Jul 7, 2015

ajdawson commented Jul 7, 2015

rcomer commented Jul 7, 2015

ajdawson Jul 7, 2015

Choose a reason for hiding this comment

ajdawson commented Jul 7, 2015

ajdawson commented Jul 7, 2015

rcomer commented Jul 7, 2015

rhattersley commented Jul 7, 2015

QuLogic commented Jul 7, 2015