-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write pearsonr to allow broadcasting and other aggregator keyword arguments #1714
Conversation
@@ -111,7 +120,8 @@ def pearsonr(cube_a, cube_b, corr_coords=None): | |||
For example providing two time/altitude/latitude/longitude | |||
cubes and corr_coords of 'latitude' and 'longitude' will result | |||
in a time/altitude cube describing the latitude/longitude | |||
(i.e. pattern) correlation at each time/altitude point. | |||
(i.e. pattern) correlation at each time/altitude point. Area | |||
weights may be set using :func:`iris.analysis.cartography.area_weights` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst this is true, it makes it sounds like that is the only way weights might be generated. It would be more appropriate to just mention that weights may be provided if required. You can mention area weighting as an example where the weights
keyword itself is documented if you like.
The change itself seems reasonable. However, it would be really nice if the testing could be done according to the new unit testing guidelines: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html. I don't want to put you off, but I'll explain (what I think is) the current position: new tests should not be added to the old unit tests, so you would at least need to add the new tests in (For bonus points you can move the existing |
Thanks @ajdawson. Since Everything else you've suggested is straightforward. |
It might be easier to just move the lot. Should be a fairly simple rename of file and docstring + whatever the developer guide suggests for class/method names, as you suggested. Don't worry too much about a new test class for the helper, as long as we don't lose test coverage in the move I'll be happy with it. |
Ah, I've only just spotted a fairly big issue (but not your fault @rcomer, the problem existed before you touched the code). The test loads data from the iris-sample-data repository. This is a no-no, all tests using data should use data in the iris-test-data repository. As for how to approach this I'm not sure right now, I'll need to have a think about it. We could just copy the relevant files into the iris-test-data repository, but it might be better if there was already some test data that could be used in place of the GloSea data. |
You'll also need to add an |
Or even just some dummy data without relying on a file at all. I'll have a think. |
Having given this some more thought, I'm wondering if a complete re-write of def pearsonr(cube_a, cube_b, corr_coords=None, weights=None):
sa = cube_a - cube_a.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
sb = cube_b - cube_b.collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
covar = (sa*sb).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
var_a = (sa**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
var_b = (sb**2).collapsed(corr_coords, iris.analysis.MEAN, weights=weights)
denom = iris.analysis.maths.apply_ufunc(np.sqrt, var_a*var_b, new_unit=covar.units)
corr_cube = covar / denom
corr_cube.rename("Pearson's r")
return corr_cube With the |
Well, I can't argue with the readability improvement that would bring. Would it solve the weighting issue too? I guess legitimate concerns might be about performance, do you know if there would be any significant change in how quickly this would compute the correlation? |
It does appear to be slower. I haven't tested properly, but running Edit to add: yes, it would solve the weighting issue. My last commit failed with |
I've restarted it, hopefully it will be fine this time. |
What size input is that out of interest? I'm wondering if it might still be worth it. |
Also, an added benefit of this is that I'm pretty sure it allows broadcasting, so I can perform correlation over time of two cubes [time] and [time, lat, lon] and the result should be the correlation map [lat, lon]... This is a big win in my eyes. |
The Glosea4 data used in the test are 6x145x192 (time x lat x lon). I had a look at the broadcasting, and it seems to work beautifully. I need to think a bit more about how it would work with weights though. |
I think the broadcasting is actually quite a big feature and I'd be happy to sacrifice a small amount of performance to accommodate it. For handling missing values the appropriate way to do it may be debatable. In this proposed implementation the mean and variances for each input would be calculated using all available data, whereas the covariance terms would be computed with only the data locations that are non-missing in both inputs. Whilst this may seem inconsistent (and may in a rare case result in absolute correlations greater than 1), I'd argue that this is acceptable, since you should use the best possible estimate you can for the mean and variance of each variable. This echoes the given in advice in Chatfield [The Analysis of Time Series: An Introduction, Chapman and Hall/CRC] regarding how to compute the statistics of each input into a lagged correlation (i.e. you compute the mean using all N elements of the time series even though N - k values are actually used in the correlation, where N is the length of the original time series and k is the lag). If this is acceptable then I think we might be home and dry already! |
Updated branch with re-write. This works for the test examples at least. Lines 68-73 and 81-90 in |
Fast work, I'm struggling to keep up! I have a handful of minor comments which I will add later. For now I think the only major things to think about are:
|
Is there some documentation listing/describing what's in the test data files? |
Good question. I think the answer is no though. I just used lscubes to have a quick scan, looks like there might be something appropriate in |
@@ -22,108 +22,40 @@ | |||
from __future__ import (absolute_import, division, print_function) | |||
|
|||
import numpy as np | |||
import numpy.ma as ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import I think.
Thanks for the tips, I think I've covered everything now. I take your point about the treatment of missing data being arguable. I guess mask-matching could be added in as an option, making it a user decision. Think I'd rather leave that for a future PR though! |
Absolutely, it is out of scope for this PR. |
Cubes should be the same shape and have the | ||
same dimension coordinates. | ||
Between which the correlation field will be calculated. Cubes should | ||
be the same shape and have the same dimension coordinates or cube_b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy way to remove the ordering restriction? Where do you see the error if these are the wrong way round?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given cubes a
and b
, where a
has the smaller dimensionality, a * b
fails in iris.analysis.maths._assert_compatible
with ValueError: The array operation would increase the dimensionality of the cube. The new cubes data would have had to become: [dimensions of b]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just find the smaller one and swap the inputs if necessary? If you make a temporary assignment it just creates a reference so doing something like:
if cube_a.ndim < cube_b.ndim:
a = cube_b
b = cube_a
else:
a = cube_a
b = cube_b
(don't care about the naming, just an example) and then use a
and b
instead of cube_a
and cube_b
won't actually create copies of cube_a
and cube_b
so there are no extra memory issues to account for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's easy enough. Done.
@rcomer - I've made a few more minor comments. The documentation comments may refer to things you didn't write originally, but I think since this is a total rewrite it might be worth addressing them now. I also want to reassure you that although this is taking a long time, the work is of very high quality and I'm very happy that you put in the time to do it all! |
@rcomer - I changed the PR title to reflect the new scope of the PR. Pinging @niallrobinson in case you are interested in this. |
Thanks @ajdawson, I'm happy with your suggestions and have made those changes. |
OK this looks in nice shape now. I'd like you to squash this into 1 commit (with a commit message that reflects the new scope of the change) before it is merged @rcomer. If you haven't done this before it is simply a case of using the rebase command in interactive mode, instructions here: http://scitools.org.uk/iris/docs/latest/developers_guide/gitwash/development_workflow.html?highlight=rebase#rewriting-commit-history. Making a backup branch first is never a bad idea. If you need help please ask. |
If you've not done it (much) before I'd say "Making a backup branch first is an excellent idea." |
I'd never done it before so thanks for the warnings! I think it worked... |
@rcomer - Sorry for the delay, I was keen to test this thoroughly since it is a major change to an existing function. So far I think I'm convinced the original functionality is duplicated here, the only potential difference is metadata. This new version has more of it essentially, since it captures changes to AuxCoords due to collapsing. This has an outside chance of causing a headache to someone, but I don't expect it to be a problem that needs a solution now, especially not given the added benefits of your method (just discovered it works for ORCA grids too which is very cool! We couldn't do that before.). Therefore, no further action required. However, since #1700 was just merged, this branch will need to be rebased against the current master as it can no longer be merged automatically. The test files will also need to be modified since they are new. #1700 just adds some imports to the top of most files to make sure they are Python2 and 3 compatible. The following should be at the top of the new test files directly below the from six.moves import (filter, input, map, range, zip) # noqa I'd suggest you make these changes first and commit it by re-writing the previous commit, since there is no need to see the change as a separate commit:
You then need to do a rebase to integrate your changes to stats.py with the change from #1700. A rebase would probably look something like this:
You might get a message about a conflict which will need to be resolved by opening the file(s) indicated by git and looking for the sections marked as in conflict, then editing these so they make sense (essentially just so that the new import is at the top). You then commit the result and use
If there are no conflicts then the last step won't be required. As before, do make a backup of your branch first in case of emergency (it also gives you that nice fuzzy feeling of knowing it doesn't matter if you make a mistake, so you can learn new git features without any stress!). You may notice I used a reference to something called upstream earlier, which assumes you have a remote named upstream which refers to the SciTools iris repository, you can check with
and if you don't have any remote referring to
(you may need to use the http protocol instead of git:// depending on your network configuration at work) As always, please ask if you need help. These things sound a little daunting at first but once you figure it out it isn't so bad. |
This also fixes #1598! |
Thanks @ajdawson for the detailed instructions. The only conflict was the existence of the old |
if weights.shape != cube_2.shape: | ||
raise ValueError("weights array should have dimensions {}". | ||
format(cube_2.shape)) | ||
dims_1_common = [i for i in xrange(cube_1.ndim) if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be just range
rather than xrange
for Python 3 compatibility.
I'm so sorry @rcomer, but I need you to make a few more very quick changes! I dropped the ball here I should have spotted them earlier. We're trying to make iris Python 2 and 3 compatible, but currently we don't have the test suite ready for Python 3. In the mean time we have to manually check for things that are not Python 3 compatible and I missed these ones previously. If I let these slide @QuLogic will be 😞 |
Re-write pearsonr to allow broadcasting and other aggregator keyword arguments
This PR represents a significant improvement to functionality, and a particularly strong effort from @rcomer. Good work! 🎉 |
Done, and learned something about Python3 along the way! |
Enables specification of
weights
keyword iniris.analysis.stats.pearsonr
, to calculate a weighted correlation. I want to use it for spatial correlations.In the unit tests, I changed the "perfect correlation" tests from
assertArrayEqual
toassertArrayAlmostEqual
because I only got 1 to seven decimal places.