-
-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Varying anova outputs #224
Comments
Hi @bjgunasekera, Thanks for opening the issue. Are you using the latest version of Pingouin? My worry is that there could be an in-place modification of the data during the first call to the ANOVA. Could you check after each anova that Also, as good practice, when you create Lastly, are there any missing values in Thanks, |
Many thanks for getting back. I updated my version to Despite the error, the outputs of the two way ( I used I investigated further by removing the Therefore, the Best wishes, |
Hi @bgunasekera, This is related to #127. Pandas categorical are very much error prone, because sometimes you have a category with no values, but it is still included when using functions such as pandas.DataFrame.groupby, unless specifying To avoid this, some functions of Pingouin automatically converts the categorical variables to string, e.g. for pingouin/pingouin/parametric.py Lines 511 to 516 in dcfdc82
However, there are two issues here:
Does that make sense? I am a little confused with the example that you gave, mostly because I do not have the data. By any chance, could you perhaps create and share here a simple toy dataset to reproduce the error? Thanks for your help on this, I appreciate it. Raphael |
Quick precision: the error here only concerns N-way ANOVA and not 1-way ANOVA since we do use pingouin/pingouin/parametric.py Lines 947 to 953 in dcfdc82
|
Hi @raphaelvallat, Many thanks. I had read in another thread that you prefer data being sent in csv format. I converted the data from spss to csv and interestingly noticed that the anova outputs were corrected, despite the code being near identical. I suspect this is because within the spss file, the independent variables were encoded as categorical variables. Therefore, Nevertheless, more than happy to share the data/ code. Unfortunately .sav files (the format of spss data) are not supported on github. Am I okay to email you all the above named files? (x2 datasets in spss+csv format & x2 scripts) Also, I was unaware that pandas categorical was unstable. Do you have a reference for this for me to read more? Best wishes, |
Hi @bgunasekera, Thank you for the detailed explanation. I do not have SPSS so I will not be able to read the files. You could perhaps try to export your data as a Parquet file (pd.DataFrame.to_parquet) which I think should save the categorical dtype. "Unstable" is a strong word — what I meant is that the fact that some categories can be hidden (no actual value in DataFrame) can easily lead to mistakes when using groupby operations (which I use a lot!). I have not access to the full article, but this blogpost may also be relevant. Best, |
…umns + added observed=True to all groupby #224
Hi @bgunasekera, I have just pushed a commit to address this (257d216). You were right that the import numpy as np
import pandas as pd
import pingouin as pg
df = pg.read_dataset("rm_anova_wide")
# melt and convert to categorical
df_piv = df.melt(ignore_index=False, var_name="time", value_name="score").reset_index()
df_piv['time'] = df_piv['time'].astype('category')
print(df_piv.info()) # "time" is categorical
pg.rm_anova(data=df_piv, dv="score", within="time", subject="index")
print(df_piv.info()) # "time" is now an object! I have now disabled this behavior. Instead of automatically converting categorical to dtypes, we are now just making sure to use If you want, please feel free to share your data and code. I would like to make sure that this fixed the issue in real-world data. Alternatively, fork Pingouin, git pull, switch to the branch Thanks, |
* Flake8 * Explicit error when y is an empty list in pg.ttest #222 * Add keyword arguments in homoscedasticity function #218 * Bugfix rm_anova and mixed_anova changed the dtypes of categorical columns + added observed=True to all groupby #224 * Update version number in init and setup * Use np.isclose for test_pearson == 1 #195 * Coverage for try..except scipy fallback * Fix set_option for pandas 1.4 * Upgraded dependencies for seaborn and statsmodels * Added Jarque-Bera test in pg.normality #216 * Coverage scipy import error * Use pd.concat instead of frame.append to avoid FutureWarning * Remove add_categories(inplace=True) to avoid FutureWarning * GH Discussions instead of Gitter * Minor doc fix
Hi,
When I run the following code for an anova:
aov2 = pg.anova(dv='RT', between=['drug', 'Salient'], ss_type=3, data=plb_cbd_RT)
pg.print_table(aov2)
I get a certain output
When I run a different pingouin anova model first, such as:
aov1 = pg.rm_anova(dv='RT',
within=['drug', 'Salient'],
subject='id_upd', data=plb_cbd_RT)
and then immediately after run the first model (aov2) again I get a different output for aov2. Although P values are similar, degrees of freedom, F statistic and other values vary massively.
Interestingly, I re-ran the anova on SPSS as a reference and I got the exact same output as the second aov2 model (running aov1 first and then aov2 immediately after)
What could be the reason for this?
Please see outputs of model and code used attached
Best wishes,
Brandon
The text was updated successfully, but these errors were encountered: