Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t-test and p-value for partial and semi-partial correlation #171

Closed
anthonybritto opened this issue May 21, 2021 · 7 comments
Closed

t-test and p-value for partial and semi-partial correlation #171

anthonybritto opened this issue May 21, 2021 · 7 comments
Assignees
Labels
bug 💥 Something isn't working invalid 🚩 This doesn't seem right

Comments

@anthonybritto
Copy link

As per the documentation of the R package ppcor : https://dx.doi.org/10.5351%2FCSAM.2015.22.6.665, the correct degrees of freedom for the calculation of the t-statistic and p-values is n - g- 2 where g is the number of covariates.

I believe that the subtraction of g from df has not been implemented in pingouin, as the three screenshots below show.

Here, I simply pull the p-value from partial_corr to show that I haven’t altered anything.
p_val_pingouin

I now compute what I believe to be the correct p-value, with df=n-(k-1)-2 since I control for all predictors but one.

p_val_cohen

Finally, I show how by using the incorrect df=n-2, I am able to reproduce pingouin’s results.
p_val_pingouin_cohen_equal

@raphaelvallat raphaelvallat self-assigned this May 21, 2021
@raphaelvallat raphaelvallat added bug 💥 Something isn't working invalid 🚩 This doesn't seem right labels May 21, 2021
@raphaelvallat
Copy link
Owner

Thanks @anthonybritto for spotting this critical issue. I need to fix this ASAP. Please see below for more details.

The problem

pingouin.partial_corr does not adjust the degrees of freedom by the number of covariates, which means that the p-value, confidence interval, power and Bayes factor are wrong. The correlation coefficient is unaffected by this issue. The plot below shows how the p-value is affected as a function of the correlation coefficient, sample size and number of covariates. In short, the p-values currently returned by Pingouin are smaller compared to the exact p-values, and this gets worse with a low sample size and/or high number of covariates and/or weak correlation coefficient. The smaller the sample size, the worse the problem is. If the sample size is large enough, and/or the correlation is high, then the p-value is largely unaffected.

image

How to fix / testing

The calculation of the p-value, confidence interval, power must be done within the pingouin.partial_corr, and not by calling pingouin.corr as is currently the case.

  • The correlation coefficient and p-value of a (semi)-partial correlation can be calculated with the ppcor package.
  • The formula to calculate the confidence intervals of a partial correlation can be found here
  • I believe the power of the correlation can be calculated by just subtracting the number of covariates to n before calling pingouin.power_corr(r, n)
  • The adjusted R^2 should be calculated with: adj_r2 = 1 - (((1 - r2) * (n - 1)) / (n - k - 2)) where k is the number of covariates (reference?).
  • A formula for the calculation of the Bayes Factor of a partial correlation is given in the appendix of Wetzels and Wagenmakers 2012. An R implementation can also be found here, though it seems to use a different method. Note that the default method for calculating the Bayes Factor of a correlation coefficient in Pingouin is based on the analytical solution provided by Ly 2016, but I did not find any mention of partial or semi-partial correlation in the paper. I am therefore not sure whether we can just subtract the number of covariates to n to get the correct BF.

@anthonybritto
Copy link
Author

anthonybritto commented May 25, 2021

Your fixes look good, far as I can tell! I unfortunately don't know anything about the Bayes Factor. I will look at the sources you reference at some point this week to see if I can help.

@raphaelvallat
Copy link
Owner

@anthonybritto I just released a new stable version of Pingouin (v0.3.12) which fixes the p-value and confidence intervals of the partial correlation. Please see the full list of modifications here: https://pingouin-stats.org/changelog.html

  • P-value
  • Confidence intervals
  • Statistical power (DISABLED)
  • Bayes Factor (DISABLED)
  • Adjusted R^2 (DISABLED)

@raphaelvallat
Copy link
Owner

FYI I have just pushed a commit on the develop branch (81d1aaf) to re-implement the partial correlation using the same method as ppcor:

# Calculate the partial corrrelation matrix - similar to pingouin.pcorr()
if method == "spearman":
# Convert the data to rank, similar to R cov()
V = data.rank(na_option='keep').cov()
else:
V = data.cov()
Vi = np.linalg.pinv(V) # Inverse covariance matrix
Vi_diag = Vi.diagonal()
D = np.diag(np.sqrt(1 / Vi_diag))
pcor = -1 * (D @ Vi @ D) # Partial correlation matrix
if covar is not None:
r = pcor[0, 1]
else:
# Semi-partial correlation matrix
with np.errstate(divide='ignore'):
spcor = pcor / \
np.sqrt(np.diag(V))[..., None] / \
np.sqrt(np.abs(Vi_diag - Vi ** 2 / Vi_diag[..., None])).T
if y_covar is not None:
r = spcor[0, 1] # y_covar is removed from y
else:
r = spcor[1, 0] # x_covar is removed from x
if np.isnan(r):
# Correlation failed. Return NaN. When would this happen?
return pd.DataFrame({'n': n, 'r': np.nan, 'CI95%': np.nan,
'p-val': np.nan}, index=[method])
# Compute the two-sided p-value and confidence intervals
# https://online.stat.psu.edu/stat505/lesson/6/6.3
pval = _correl_pvalue(r, n, k)
ci = compute_esci(
stat=r, nx=(n - k), ny=(n - k), eftype='r', decimals=6)
# Create dictionnary
stats = {
'n': n,
'r': r,
'CI95%': [ci],
'p-val': pval if tail == 'two-sided' else .5 * pval,
}

We're now getting exactly the same results as ppcor for partial/semi-partial correlations (either Pearson or Spearman). I think this will be included in the next stable release of Pingouin.

@anthonybritto
Copy link
Author

anthonybritto commented Jul 20, 2021

@raphaelvallat Thanks for keeping me updated.

I see that you have updated the p-value calculation to use the beta-distribution: I am not familiar with this approach, so I'm afraid I can't comment.

If you are however reproducing the ppcor results, I should reiterate that the p-value of the semi-partial correlation is incorrect. For the record, I attach a preprint of the communication that I wish to submit to the journal in which ppcor was published; it includes what I believe are the correct formulas for the t-statistics of the betas, and the partial and semi-partial correlations .

ppcor_correction.pdf

@raphaelvallat
Copy link
Owner

Hi @anthonybritto,

  1. The formula for the p-value calculation can be found on Wikipedia and is used in the scipy.stats.pearsonr function. It gives similar results to the method based on the T distribution, but faster.

  2. Did you hear back from the authors of the ppcor package? When the paper is peer-reviewed and accepted, I think it would be great to make a PR to integrate your suggested edits. For now however, I'd prefer to keep it consistent with the R output as it is what most users would expect.

Thanks,
Raphael

@raphaelvallat
Copy link
Owner

Closing this issue but feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 💥 Something isn't working invalid 🚩 This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants