Varying anova outputs #224

bgunasekera · 2022-01-20T13:45:42Z

Hi,

When I run the following code for an anova:

aov2 = pg.anova(dv='RT', between=['drug', 'Salient'], ss_type=3, data=plb_cbd_RT)
pg.print_table(aov2)

I get a certain output

When I run a different pingouin anova model first, such as:
aov1 = pg.rm_anova(dv='RT',
within=['drug', 'Salient'],
subject='id_upd', data=plb_cbd_RT)

and then immediately after run the first model (aov2) again I get a different output for aov2. Although P values are similar, degrees of freedom, F statistic and other values vary massively.

Interestingly, I re-ran the anova on SPSS as a reference and I got the exact same output as the second aov2 model (running aov1 first and then aov2 immediately after)

What could be the reason for this?

Please see outputs of model and code used attached

Best wishes,
Brandon

raphaelvallat · 2022-01-20T17:47:36Z

Hi @bjgunasekera,

Thanks for opening the issue. Are you using the latest version of Pingouin? My worry is that there could be an in-place modification of the data during the first call to the ANOVA. Could you check after each anova that plb_cbd_RT has not changed? For example you can make a copy with plb_cbd_RT_backup = plb_cbd_RT.copy() and then compare with pandas.DataFrame.equals.

Also, as good practice, when you create df_rt from df and then plb_cbd_RT from df_rt, I would always recommend that you use copy, otherwise any modifications that you make to one dataframe might end up being applied to the original dataframe (df or df_rt).

Lastly, are there any missing values in plb_cbd_RT?

Thanks,
Raphael

bgunasekera · 2022-01-21T10:08:45Z

Hi @raphaelvallat

Many thanks for getting back.

I updated my version to 0.5.0. This raised AssertionError: Factor must have at least two levels for the aov1 model. However, that is okay. The purpose of my query isn't to have the aov1 model working, rather, to understand why the outputs of the two way anova model (aov2) are different.

Despite the error, the outputs of the two way (aov2) remained different.

I used plb_cbd_RT_backup = plb_cbd_RT.copy() and then compared with pandas.DataFrame.equals as you suggested. After inspection, running the aov1 model (rm_anova) changed the drug variable from a category to an object. My independent & dependent variables Salient and RT remained as integers and floats respectively.

I investigated further by removing the aov1 (rm_anova) model and converting the independent variables to categories. This gave the original result. I then converted the independent variables to objects. This then gave the second result.

Therefore, the rm_anovaand mixed_anova (which I also tested) are performing this type conversion, from category to object. Is there a reason for this? Also, the first anova model has a massive F statistic for drug ~15000. This doesn't seem right? The output with the variables as objects seems more appropriate

Best wishes,
Brandon

raphaelvallat · 2022-01-22T01:47:54Z

Hi @bgunasekera,

This is related to #127. Pandas categorical are very much error prone, because sometimes you have a category with no values, but it is still included when using functions such as pandas.DataFrame.groupby, unless specifying observed=True.

To avoid this, some functions of Pingouin automatically converts the categorical variables to string, e.g. for rm_anova:

pingouin/pingouin/parametric.py

Lines 511 to 516 in dcfdc82

    
           # Convert Categorical columns to string 
        
           # This is important otherwise all the groupby will return different results 
        
           # unless we specify .groupby(..., observed = True). 
        
           for c in [subject, within]: 
        
               if data[c].dtype.name == 'category': 
        
                   data[c] = data[c].astype(str)

However, there are two issues here:

The first is that this conversion is not applied in the pg.anova function. This is something that I should add asap to avoid such errors.
The second is that the conversion categorical --> object should not be applied in-place — i.e. modifying your original dataframe — but instead on a copy of the dataframe.

Does that make sense? I am a little confused with the example that you gave, mostly because I do not have the data. By any chance, could you perhaps create and share here a simple toy dataset to reproduce the error?

Thanks for your help on this, I appreciate it.

Raphael

raphaelvallat · 2022-01-22T01:52:41Z

Quick precision: the error here only concerns N-way ANOVA and not 1-way ANOVA since we do use observed=True for the latter:

pingouin/pingouin/parametric.py

Lines 947 to 953 in dcfdc82

    
           # Calculate sums of squares 
        
           grp = data.groupby(between, observed=True)[dv] 
        
           # Between effect 
        
           ssbetween = ((grp.mean() - data[dv].mean())**2 * grp.count()).sum() 
        
           # Within effect (= error between) 
        
           #  = (grp.var(ddof=0) * grp.count()).sum() 
        
           sserror = grp.apply(lambda x: (x - x.mean())**2).sum()

bgunasekera · 2022-01-23T15:00:06Z

Hi @raphaelvallat,

Many thanks.

I had read in another thread that you prefer data being sent in csv format. I converted the data from spss to csv and interestingly noticed that the anova outputs were corrected, despite the code being near identical.

I suspect this is because within the spss file, the independent variables were encoded as categorical variables. Therefore, pd.read_spss also translated this to categorical variables within the data frame. Therefore, as you previously mentioned, given the instability of pandas categorical, this gave rise to the incorrect anova output. Given that within the csv file there was no pre-encoding of the variables, pd.read_csv simply registered the independent variables as integers.

Nevertheless, more than happy to share the data/ code. Unfortunately .sav files (the format of spss data) are not supported on github. Am I okay to email you all the above named files? (x2 datasets in spss+csv format & x2 scripts)

Also, I was unaware that pandas categorical was unstable. Do you have a reference for this for me to read more?

Best wishes,
Brandon

raphaelvallat · 2022-01-25T22:06:56Z

Hi @bgunasekera,

Thank you for the detailed explanation. I do not have SPSS so I will not be able to read the files. You could perhaps try to export your data as a Parquet file (pd.DataFrame.to_parquet) which I think should save the categorical dtype.

"Unstable" is a strong word — what I meant is that the fact that some categories can be hidden (no actual value in DataFrame) can easily lead to mistakes when using groupby operations (which I use a lot!). I have not access to the full article, but this blogpost may also be relevant.

Best,
Raphael

…umns + added observed=True to all groupby #224

raphaelvallat · 2022-02-12T18:31:13Z

Hi @bgunasekera,

I have just pushed a commit to address this (257d216). You were right that the pg.rm_anova function did change the dtypes of categorical columns to object IN-PLACE:

import numpy as np
import pandas as pd
import pingouin as pg
df = pg.read_dataset("rm_anova_wide")
# melt and convert to categorical
df_piv = df.melt(ignore_index=False, var_name="time", value_name="score").reset_index()
df_piv['time'] = df_piv['time'].astype('category')
print(df_piv.info()) # "time" is categorical
pg.rm_anova(data=df_piv, dv="score", within="time", subject="index")
print(df_piv.info())  # "time" is now an object!

I have now disabled this behavior. Instead of automatically converting categorical to dtypes, we are now just making sure to use observed=True whenever applying a pandas.groupby or pandas.pivot_table. This ensures that only categories with actual, observed values are taken into account when computing the ANOVA.

If you want, please feel free to share your data and code. I would like to make sure that this fixed the issue in real-world data. Alternatively, fork Pingouin, git pull, switch to the branch v0.5.1 and re-run your code: the bug should have disappeared.

Thanks,
Raphael

* Flake8 * Explicit error when y is an empty list in pg.ttest #222 * Add keyword arguments in homoscedasticity function #218 * Bugfix rm_anova and mixed_anova changed the dtypes of categorical columns + added observed=True to all groupby #224 * Update version number in init and setup * Use np.isclose for test_pearson == 1 #195 * Coverage for try..except scipy fallback * Fix set_option for pandas 1.4 * Upgraded dependencies for seaborn and statsmodels * Added Jarque-Bera test in pg.normality #216 * Coverage scipy import error * Use pd.concat instead of frame.append to avoid FutureWarning * Remove add_categories(inplace=True) to avoid FutureWarning * GH Discussions instead of Gitter * Minor doc fix

raphaelvallat self-assigned this Jan 20, 2022

raphaelvallat added the invalid 🚩 This doesn't seem right label Jan 20, 2022

raphaelvallat added a commit that referenced this issue Feb 12, 2022

Bugfix rm_anova and mixed_anova changed the dtypes of categorical col…

257d216

…umns + added observed=True to all groupby #224

raphaelvallat mentioned this issue Feb 12, 2022

Release v0.5.1 #236

Merged

12 tasks

raphaelvallat added the IMPORTANT❗ label Feb 12, 2022

raphaelvallat closed this as completed Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Varying anova outputs #224

Varying anova outputs #224

bgunasekera commented Jan 20, 2022 •

edited

Loading

raphaelvallat commented Jan 20, 2022

bgunasekera commented Jan 21, 2022 •

edited

Loading

raphaelvallat commented Jan 22, 2022

raphaelvallat commented Jan 22, 2022

bgunasekera commented Jan 23, 2022 •

edited

Loading

raphaelvallat commented Jan 25, 2022

raphaelvallat commented Feb 12, 2022

Varying anova outputs #224

Varying anova outputs #224

Comments

bgunasekera commented Jan 20, 2022 • edited Loading

raphaelvallat commented Jan 20, 2022

bgunasekera commented Jan 21, 2022 • edited Loading

raphaelvallat commented Jan 22, 2022

raphaelvallat commented Jan 22, 2022

bgunasekera commented Jan 23, 2022 • edited Loading

raphaelvallat commented Jan 25, 2022

raphaelvallat commented Feb 12, 2022

bgunasekera commented Jan 20, 2022 •

edited

Loading

bgunasekera commented Jan 21, 2022 •

edited

Loading

bgunasekera commented Jan 23, 2022 •

edited

Loading