Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write pearsonr to allow broadcasting and other aggregator keyword arguments #1714

Merged
merged 1 commit into from
Jul 7, 2015

Conversation

rcomer
Copy link
Member

@rcomer rcomer commented Jun 26, 2015

Enables specification of weights keyword in iris.analysis.stats.pearsonr, to calculate a weighted correlation. I want to use it for spatial correlations.

In the unit tests, I changed the "perfect correlation" tests from assertArrayEqual to assertArrayAlmostEqual because I only got 1 to seven decimal places.

@@ -111,7 +120,8 @@ def pearsonr(cube_a, cube_b, corr_coords=None):
For example providing two time/altitude/latitude/longitude
cubes and corr_coords of 'latitude' and 'longitude' will result
in a time/altitude cube describing the latitude/longitude
(i.e. pattern) correlation at each time/altitude point.
(i.e. pattern) correlation at each time/altitude point. Area
weights may be set using :func:`iris.analysis.cartography.area_weights`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst this is true, it makes it sounds like that is the only way weights might be generated. It would be more appropriate to just mention that weights may be provided if required. You can mention area weighting as an example where the weights keyword itself is documented if you like.

@ajdawson
Copy link
Member

The change itself seems reasonable. However, it would be really nice if the testing could be done according to the new unit testing guidelines: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html.

I don't want to put you off, but I'll explain (what I think is) the current position: new tests should not be added to the old unit tests, so you would at least need to add the new tests in lib/iris/tests/unit/analysis/stats/test_pearsonr.py. If you have trouble with this we'd be happy to help.

(For bonus points you can move the existing pearsonr tests there too!)

@rcomer
Copy link
Member Author

rcomer commented Jun 26, 2015

Thanks @ajdawson. Since test_stats.py currently only tests one function, is it simply a matter of moving and renaming the file, and changing the docstring? Or is there something more involved? I suspect there ought to be some tests for _get_calc_view too, but I'm not sure I want to tackle that. Should I remove or revert the old test_stats.py file?

Everything else you've suggested is straightforward.

@ajdawson
Copy link
Member

It might be easier to just move the lot. Should be a fairly simple rename of file and docstring + whatever the developer guide suggests for class/method names, as you suggested.

Don't worry too much about a new test class for the helper, as long as we don't lose test coverage in the move I'll be happy with it.

@ajdawson
Copy link
Member

Ah, I've only just spotted a fairly big issue (but not your fault @rcomer, the problem existed before you touched the code). The test loads data from the iris-sample-data repository. This is a no-no, all tests using data should use data in the iris-test-data repository.

As for how to approach this I'm not sure right now, I'll need to have a think about it. We could just copy the relevant files into the iris-test-data repository, but it might be better if there was already some test data that could be used in place of the GloSea data.

@ajdawson
Copy link
Member

You'll also need to add an lib/iris/tests/unit/analysis/stats/__init__.py file otherwise the test suite inside won't be run.

@ajdawson
Copy link
Member

but it might be better if there was already some test data that could be used in place of the GloSea data.

Or even just some dummy data without relying on a file at all. I'll have a think.

@rcomer
Copy link
Member Author

rcomer commented Jun 27, 2015

Having given this some more thought, I'm wondering if a complete re-write of pearsonr might be more sensible. Along these lines:

def pearsonr(cube_a, cube_b, corr_coords=None, weights=None):
    sa = cube_a - cube_a.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    sb = cube_b - cube_b.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)

    covar = (sa*sb).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    var_a = (sa**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
    var_b = (sb**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)

    denom = iris.analysis.maths.apply_ufunc(np.sqrt, var_a*var_b, new_unit=covar.units)
    corr_cube = covar / denom    
    corr_cube.rename("Pearson's r")

    return corr_cube

With the collapsed method doing all the complicated data-rearranging stuff, there is far less that could go wrong.

@ajdawson
Copy link
Member

Having given this some more thought, I'm wondering if a complete re-write of pearsonr might be more sensible.

Well, I can't argue with the readability improvement that would bring. Would it solve the weighting issue too? I guess legitimate concerns might be about performance, do you know if there would be any significant change in how quickly this would compute the correlation?

@rcomer
Copy link
Member Author

rcomer commented Jun 29, 2015

It does appear to be slower. I haven't tested properly, but running test_pearsonr.py a few times on the current version of my branch takes ~0.8s each time. Running it on the re-write takes ~1s each time.

Edit to add: yes, it would solve the weighting issue.

My last commit failed with TimedOutException. Not sure why.

@ajdawson
Copy link
Member

My last commit failed with TimedOutException. Not sure why.

I've restarted it, hopefully it will be fine this time.

@ajdawson
Copy link
Member

It does appear to be slower. I haven't tested properly, but running test_pearsonr.py a few times on the current version of my branch takes ~0.8s each time. Running it on the re-write takes ~1s each time.

What size input is that out of interest? I'm wondering if it might still be worth it.

@ajdawson
Copy link
Member

Also, an added benefit of this is that I'm pretty sure it allows broadcasting, so I can perform correlation over time of two cubes [time] and [time, lat, lon] and the result should be the correlation map [lat, lon]... This is a big win in my eyes.

@rcomer
Copy link
Member Author

rcomer commented Jun 29, 2015

The Glosea4 data used in the test are 6x145x192 (time x lat x lon).

I had a look at the broadcasting, and it seems to work beautifully. I need to think a bit more about how it would work with weights though.

@ajdawson
Copy link
Member

I think the broadcasting is actually quite a big feature and I'd be happy to sacrifice a small amount of performance to accommodate it.

For handling missing values the appropriate way to do it may be debatable. In this proposed implementation the mean and variances for each input would be calculated using all available data, whereas the covariance terms would be computed with only the data locations that are non-missing in both inputs. Whilst this may seem inconsistent (and may in a rare case result in absolute correlations greater than 1), I'd argue that this is acceptable, since you should use the best possible estimate you can for the mean and variance of each variable. This echoes the given in advice in Chatfield [The Analysis of Time Series: An Introduction, Chapman and Hall/CRC] regarding how to compute the statistics of each input into a lagged correlation (i.e. you compute the mean using all N elements of the time series even though N - k values are actually used in the correlation, where N is the length of the original time series and k is the lag). If this is acceptable then I think we might be home and dry already!

@rcomer
Copy link
Member Author

rcomer commented Jun 29, 2015

Updated branch with re-write. This works for the test examples at least.

Lines 68-73 and 81-90 in stats.py are there purely to catch the case when both broadcasting and weights are used. It would be nice to find a neater way if possible. I'm also not sure what to do for cases where a cube has anonymous dimensions (though I don't think the original pearsonr would have handled that either).

@ajdawson
Copy link
Member

Fast work, I'm struggling to keep up! I have a handful of minor comments which I will add later. For now I think the only major things to think about are:

  1. perhaps allowing other collapsed keywords to be passed through to pearsonr, I'm thinking of mdtol specifically.
  2. Testing this without using the sample data (although I am pleased to see the existing tests pass with this refactor).

@rcomer
Copy link
Member Author

rcomer commented Jun 29, 2015

Is there some documentation listing/describing what's in the test data files?

@ajdawson
Copy link
Member

Good question. I think the answer is no though. I just used lscubes to have a quick scan, looks like there might be something appropriate in test_data/NetCDF/global/xyt/SMALL_total_column_co2.nc on first glance but you might find something better.

@@ -22,108 +22,40 @@
from __future__ import (absolute_import, division, print_function)

import numpy as np
import numpy.ma as ma
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import I think.

@rcomer
Copy link
Member Author

rcomer commented Jul 1, 2015

Thanks for the tips, I think I've covered everything now.

I take your point about the treatment of missing data being arguable. I guess mask-matching could be added in as an option, making it a user decision. Think I'd rather leave that for a future PR though!

@ajdawson
Copy link
Member

ajdawson commented Jul 1, 2015

I guess mask-matching could be added in as an option, making it a user decision. Think I'd rather leave that for a future PR though!

Absolutely, it is out of scope for this PR.

Cubes should be the same shape and have the
same dimension coordinates.
Between which the correlation field will be calculated. Cubes should
be the same shape and have the same dimension coordinates or cube_b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to remove the ordering restriction? Where do you see the error if these are the wrong way round?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given cubes a and b, where a has the smaller dimensionality, a * b fails in iris.analysis.maths._assert_compatible with ValueError: The array operation would increase the dimensionality of the cube. The new cubes data would have had to become: [dimensions of b].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just find the smaller one and swap the inputs if necessary? If you make a temporary assignment it just creates a reference so doing something like:

if cube_a.ndim < cube_b.ndim:
    a = cube_b
    b = cube_a
else:
    a = cube_a
    b = cube_b

(don't care about the naming, just an example) and then use a and b instead of cube_a and cube_b won't actually create copies of cube_a and cube_b so there are no extra memory issues to account for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's easy enough. Done.

@ajdawson
Copy link
Member

ajdawson commented Jul 2, 2015

@rcomer - I've made a few more minor comments. The documentation comments may refer to things you didn't write originally, but I think since this is a total rewrite it might be worth addressing them now. I also want to reassure you that although this is taking a long time, the work is of very high quality and I'm very happy that you put in the time to do it all!

@ajdawson ajdawson changed the title Weights keyword for pearsonr Re-write pearsonr to allow broadcasting and other aggregator keyword arguments Jul 2, 2015
@ajdawson
Copy link
Member

ajdawson commented Jul 2, 2015

@rcomer - I changed the PR title to reflect the new scope of the PR.

Pinging @niallrobinson in case you are interested in this.

@ajdawson ajdawson self-assigned this Jul 3, 2015
@rcomer
Copy link
Member Author

rcomer commented Jul 3, 2015

Thanks @ajdawson, I'm happy with your suggestions and have made those changes.

@ajdawson
Copy link
Member

ajdawson commented Jul 3, 2015

OK this looks in nice shape now. I'd like you to squash this into 1 commit (with a commit message that reflects the new scope of the change) before it is merged @rcomer.

If you haven't done this before it is simply a case of using the rebase command in interactive mode, instructions here: http://scitools.org.uk/iris/docs/latest/developers_guide/gitwash/development_workflow.html?highlight=rebase#rewriting-commit-history. Making a backup branch first is never a bad idea. If you need help please ask.

@rhattersley
Copy link
Member

Making a backup branch first is never a bad idea.

If you've not done it (much) before I'd say "Making a backup branch first is an excellent idea."

@rcomer rcomer force-pushed the weight-pearsonr branch from eefaca9 to 137c2fc Compare July 3, 2015 15:37
@rcomer
Copy link
Member Author

rcomer commented Jul 3, 2015

I'd never done it before so thanks for the warnings! I think it worked...

@ajdawson
Copy link
Member

ajdawson commented Jul 7, 2015

@rcomer - Sorry for the delay, I was keen to test this thoroughly since it is a major change to an existing function. So far I think I'm convinced the original functionality is duplicated here, the only potential difference is metadata. This new version has more of it essentially, since it captures changes to AuxCoords due to collapsing. This has an outside chance of causing a headache to someone, but I don't expect it to be a problem that needs a solution now, especially not given the added benefits of your method (just discovered it works for ORCA grids too which is very cool! We couldn't do that before.). Therefore, no further action required.

However, since #1700 was just merged, this branch will need to be rebased against the current master as it can no longer be merged automatically. The test files will also need to be modified since they are new. #1700 just adds some imports to the top of most files to make sure they are Python2 and 3 compatible.

The following should be at the top of the new test files directly below the from __future__ ... imports:

from six.moves import (filter, input, map, range, zip)  # noqa

I'd suggest you make these changes first and commit it by re-writing the previous commit, since there is no need to see the change as a separate commit:

$ git commit -a --amend

You then need to do a rebase to integrate your changes to stats.py with the change from #1700. A rebase would probably look something like this:

$ git checkout weight-pearsonr
$ git fetch upstream
$ git rebase upstream/master

You might get a message about a conflict which will need to be resolved by opening the file(s) indicated by git and looking for the sections marked as in conflict, then editing these so they make sense (essentially just so that the new import is at the top). You then commit the result and use

$ git rebase --continue

If there are no conflicts then the last step won't be required. As before, do make a backup of your branch first in case of emergency (it also gives you that nice fuzzy feeling of knowing it doesn't matter if you make a mistake, so you can learn new git features without any stress!).

You may notice I used a reference to something called upstream earlier, which assumes you have a remote named upstream which refers to the SciTools iris repository, you can check with

git remote -v

and if you don't have any remote referring to github.com/SciTools/iris then you can create one with

git remote add upstream git://github.com/SciTools/iris.git

(you may need to use the http protocol instead of git:// depending on your network configuration at work)

As always, please ask if you need help. These things sound a little daunting at first but once you figure it out it isn't so bad.

@ajdawson
Copy link
Member

ajdawson commented Jul 7, 2015

This also fixes #1598!

@rcomer rcomer force-pushed the weight-pearsonr branch from 137c2fc to 3ffd02a Compare July 7, 2015 10:38
@rcomer
Copy link
Member Author

rcomer commented Jul 7, 2015

Thanks @ajdawson for the detailed instructions. The only conflict was the existence of the old test_stats.py file, so it was pretty straight forward.

if weights.shape != cube_2.shape:
raise ValueError("weights array should have dimensions {}".
format(cube_2.shape))
dims_1_common = [i for i in xrange(cube_1.ndim) if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be just range rather than xrange for Python 3 compatibility.

@ajdawson
Copy link
Member

ajdawson commented Jul 7, 2015

I'm so sorry @rcomer, but I need you to make a few more very quick changes! I dropped the ball here I should have spotted them earlier. We're trying to make iris Python 2 and 3 compatible, but currently we don't have the test suite ready for Python 3. In the mean time we have to manually check for things that are not Python 3 compatible and I missed these ones previously. If I let these slide @QuLogic will be 😞

@rcomer rcomer force-pushed the weight-pearsonr branch from 3ffd02a to 9c522d7 Compare July 7, 2015 12:09
ajdawson added a commit that referenced this pull request Jul 7, 2015
Re-write pearsonr to allow broadcasting and other aggregator keyword arguments
@ajdawson ajdawson merged commit 2791388 into SciTools:master Jul 7, 2015
@ajdawson
Copy link
Member

ajdawson commented Jul 7, 2015

This PR represents a significant improvement to functionality, and a particularly strong effort from @rcomer. Good work! 🎉

@rcomer
Copy link
Member Author

rcomer commented Jul 7, 2015

Done, and learned something about Python3 along the way!

@rcomer rcomer deleted the weight-pearsonr branch July 7, 2015 12:40
@rhattersley
Copy link
Member

This PR represents a significant improvement to functionality, and a particularly strong effort from @rcomer.

Well put! 👍 Top work @rcomer, and thanks for guiding this through @ajdawson.

@QuLogic
Copy link
Member

QuLogic commented Jul 7, 2015

If I let these slide @QuLogic will be 😞

Thanks @ajdawson 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants