Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varying anova outputs #224

Closed
Tracked by #236
bgunasekera opened this issue Jan 20, 2022 · 7 comments
Closed
Tracked by #236

Varying anova outputs #224

bgunasekera opened this issue Jan 20, 2022 · 7 comments
Assignees
Labels
IMPORTANT❗ invalid 🚩 This doesn't seem right

Comments

@bgunasekera
Copy link

bgunasekera commented Jan 20, 2022

Hi,

When I run the following code for an anova:

aov2 = pg.anova(dv='RT', between=['drug', 'Salient'], ss_type=3, data=plb_cbd_RT)
pg.print_table(aov2)

I get a certain output

When I run a different pingouin anova model first, such as:
aov1 = pg.rm_anova(dv='RT',
within=['drug', 'Salient'],
subject='id_upd', data=plb_cbd_RT)

and then immediately after run the first model (aov2) again I get a different output for aov2. Although P values are similar, degrees of freedom, F statistic and other values vary massively.

Interestingly, I re-ran the anova on SPSS as a reference and I got the exact same output as the second aov2 model (running aov1 first and then aov2 immediately after)

What could be the reason for this?

Please see outputs of model and code used attached

Best wishes,
Brandon

code
spss
run2
run1

@raphaelvallat raphaelvallat self-assigned this Jan 20, 2022
@raphaelvallat raphaelvallat added the invalid 🚩 This doesn't seem right label Jan 20, 2022
@raphaelvallat
Copy link
Owner

Hi @bjgunasekera,

Thanks for opening the issue. Are you using the latest version of Pingouin? My worry is that there could be an in-place modification of the data during the first call to the ANOVA. Could you check after each anova that plb_cbd_RT has not changed? For example you can make a copy with plb_cbd_RT_backup = plb_cbd_RT.copy() and then compare with pandas.DataFrame.equals.

Also, as good practice, when you create df_rt from df and then plb_cbd_RT from df_rt, I would always recommend that you use copy, otherwise any modifications that you make to one dataframe might end up being applied to the original dataframe (df or df_rt).

Lastly, are there any missing values in plb_cbd_RT?

Thanks,
Raphael

@bgunasekera
Copy link
Author

bgunasekera commented Jan 21, 2022

Hi @raphaelvallat

Many thanks for getting back.

I updated my version to 0.5.0. This raised AssertionError: Factor must have at least two levels for the aov1 model. However, that is okay. The purpose of my query isn't to have the aov1 model working, rather, to understand why the outputs of the two way anova model (aov2) are different.

Despite the error, the outputs of the two way (aov2) remained different.

I used plb_cbd_RT_backup = plb_cbd_RT.copy() and then compared with pandas.DataFrame.equals as you suggested. After inspection, running the aov1 model (rm_anova) changed the drug variable from a category to an object. My independent & dependent variables Salient and RT remained as integers and floats respectively.

I investigated further by removing the aov1 (rm_anova) model and converting the independent variables to categories. This gave the original result. I then converted the independent variables to objects. This then gave the second result.

Therefore, the rm_anovaand mixed_anova (which I also tested) are performing this type conversion, from category to object. Is there a reason for this? Also, the first anova model has a massive F statistic for drug ~15000. This doesn't seem right? The output with the variables as objects seems more appropriate

Best wishes,
Brandon

@raphaelvallat
Copy link
Owner

Hi @bgunasekera,

This is related to #127. Pandas categorical are very much error prone, because sometimes you have a category with no values, but it is still included when using functions such as pandas.DataFrame.groupby, unless specifying observed=True.

To avoid this, some functions of Pingouin automatically converts the categorical variables to string, e.g. for rm_anova:

# Convert Categorical columns to string
# This is important otherwise all the groupby will return different results
# unless we specify .groupby(..., observed = True).
for c in [subject, within]:
if data[c].dtype.name == 'category':
data[c] = data[c].astype(str)

However, there are two issues here:

  1. The first is that this conversion is not applied in the pg.anova function. This is something that I should add asap to avoid such errors.

  2. The second is that the conversion categorical --> object should not be applied in-place — i.e. modifying your original dataframe — but instead on a copy of the dataframe.

Does that make sense? I am a little confused with the example that you gave, mostly because I do not have the data. By any chance, could you perhaps create and share here a simple toy dataset to reproduce the error?

Thanks for your help on this, I appreciate it.

Raphael

@raphaelvallat
Copy link
Owner

Quick precision: the error here only concerns N-way ANOVA and not 1-way ANOVA since we do use observed=True for the latter:

# Calculate sums of squares
grp = data.groupby(between, observed=True)[dv]
# Between effect
ssbetween = ((grp.mean() - data[dv].mean())**2 * grp.count()).sum()
# Within effect (= error between)
# = (grp.var(ddof=0) * grp.count()).sum()
sserror = grp.apply(lambda x: (x - x.mean())**2).sum()

@bgunasekera
Copy link
Author

bgunasekera commented Jan 23, 2022

Hi @raphaelvallat,

Many thanks.

I had read in another thread that you prefer data being sent in csv format. I converted the data from spss to csv and interestingly noticed that the anova outputs were corrected, despite the code being near identical.

I suspect this is because within the spss file, the independent variables were encoded as categorical variables. Therefore, pd.read_spss also translated this to categorical variables within the data frame. Therefore, as you previously mentioned, given the instability of pandas categorical, this gave rise to the incorrect anova output. Given that within the csv file there was no pre-encoding of the variables, pd.read_csv simply registered the independent variables as integers.

Nevertheless, more than happy to share the data/ code. Unfortunately .sav files (the format of spss data) are not supported on github. Am I okay to email you all the above named files? (x2 datasets in spss+csv format & x2 scripts)

Also, I was unaware that pandas categorical was unstable. Do you have a reference for this for me to read more?

Best wishes,
Brandon

@raphaelvallat
Copy link
Owner

Hi @bgunasekera,

Thank you for the detailed explanation. I do not have SPSS so I will not be able to read the files. You could perhaps try to export your data as a Parquet file (pd.DataFrame.to_parquet) which I think should save the categorical dtype.

"Unstable" is a strong word — what I meant is that the fact that some categories can be hidden (no actual value in DataFrame) can easily lead to mistakes when using groupby operations (which I use a lot!). I have not access to the full article, but this blogpost may also be relevant.

Best,
Raphael

raphaelvallat added a commit that referenced this issue Feb 12, 2022
@raphaelvallat
Copy link
Owner

Hi @bgunasekera,

I have just pushed a commit to address this (257d216). You were right that the pg.rm_anova function did change the dtypes of categorical columns to object IN-PLACE:

import numpy as np
import pandas as pd
import pingouin as pg
df = pg.read_dataset("rm_anova_wide")
# melt and convert to categorical
df_piv = df.melt(ignore_index=False, var_name="time", value_name="score").reset_index()
df_piv['time'] = df_piv['time'].astype('category')
print(df_piv.info()) # "time" is categorical
pg.rm_anova(data=df_piv, dv="score", within="time", subject="index")
print(df_piv.info())  # "time" is now an object!

I have now disabled this behavior. Instead of automatically converting categorical to dtypes, we are now just making sure to use observed=True whenever applying a pandas.groupby or pandas.pivot_table. This ensures that only categories with actual, observed values are taken into account when computing the ANOVA.

If you want, please feel free to share your data and code. I would like to make sure that this fixed the issue in real-world data. Alternatively, fork Pingouin, git pull, switch to the branch v0.5.1 and re-run your code: the bug should have disappeared.

Thanks,
Raphael

@raphaelvallat raphaelvallat mentioned this issue Feb 12, 2022
12 tasks
raphaelvallat added a commit that referenced this issue Feb 20, 2022
* Flake8

* Explicit error when y is an empty list in pg.ttest

#222

* Add keyword arguments in homoscedasticity function

#218

* Bugfix rm_anova and mixed_anova changed the dtypes of categorical columns + added observed=True to all groupby

#224

* Update version number in init and setup

* Use np.isclose for test_pearson == 1

#195

* Coverage for try..except scipy fallback

* Fix set_option for pandas 1.4

* Upgraded dependencies for seaborn and statsmodels

* Added Jarque-Bera test in pg.normality

#216

* Coverage scipy import error

* Use pd.concat instead of frame.append to avoid FutureWarning

* Remove add_categories(inplace=True) to avoid FutureWarning

* GH Discussions instead of Gitter

* Minor doc fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IMPORTANT❗ invalid 🚩 This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants