Skip to content

Issues with R's cov pairwise complete

Vince Buffalo edited this page Mar 1, 2019 · 3 revisions

I cannot tell if this is a bug, or something about the pairwise complete implementation I'm missing. An MRE:

a <- matrix(c(4, 1, 3, NA, 1, 8, NA, 3), byrow=TRUE, ncol=2)
a
     [,1] [,2]
[1,]    4    1
[2,]    3   NA
[3,]    1    8
[4,]   NA    3
cov(a, use='pairwise.complete')
           [,1]  [,2]
[1,]   2.333333 -10.5
[2,] -10.500000  13.0

Now, let's do this by hand — we calculate the covariance by mean-centering the elements and taking the inner product, and dividing by 1 (N - 1 for an unbiased estimator, since we lose a df on the mean, and the number of pairwise complete cases is 2):

b = scale(a, scale=FALSE); sum(b[, 1] * b[, 2], na.rm=TRUE)
[1] -10.66667

What's going on? Note that I get this same result with my Python numpy implementation.

Note the variances are as expected:

> b = scale(a, scale=FALSE); sum(b[, 1] * b[, 1], na.rm=TRUE) / 2
[1] 2.333333
> b = scale(a, scale=FALSE); sum(b[, 2] * b[, 2], na.rm=TRUE) / 2
[1] 13

One possibility is that this is due to numeric stability issues. Let's look at the calculation:

4/3 * -3 + -5/3 * 4
-12/3 - 20/3
-32/3 = -10.666...

which matches my result.

Solution

From Wrathematics:

In your (@vsbuffalo) example, you should calculate the means as (4+1)/2 and (1+8)/2. Then ((4-2.5)(1-4.5)+(1-2.5)(8-4.5))/1 = 10.5.

This surprises me a bit, but does make sense. It seems we get a better estimate of the mean with all data, but this could introduce bias. Will have to tinker with the underlying estimators later.

What's right?

Code here: https://gist.github.com/4218bafee0a651f3f5e5408500516b9e

Caveats — mean absolute difference is possible not the best matrix (for reasons unclear to me, sum absolute difference returns the opposite trend!). However note that with this much missing data, there are some very pathological cases!

Clone this wiki locally