Decouple df.head() from the Cramer's computation #1179

glemaitre · 2024-12-05T18:08:32Z

I got kind of surprise when I did the following display

import skrub
from sklearn.datasets import fetch_california_housing

skrub.patch_display()
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X.head()

It took some time to understand that the reason was due to the X.head() and that in this case, it was making sense.

I'm wondering if you should avoid computing all the different values when one call X.head() instead of showing the statistics on few line. It can be misleading.

An alternative is to compute the statistics on the full dataset instead even if a user request to check the .head(). However if you call .head() it might be only because you are interested of seeing the couple of first line of the dataframe without checking any other statistics.

@jeromedockes WDYT?

The text was updated successfully, but these errors were encountered:

Vincent-Maladiere · 2024-12-06T07:06:15Z

How could the TableReport access the full dataframe if you pass .head()?

glemaitre · 2024-12-06T07:31:48Z

I did not think on how it is implemented. So it seems that the most reasonable solution is to avoid computing some of the statistics when the sample size is really small < 10?

jeromedockes · 2024-12-06T09:51:55Z

not computing the associations under a certain sample size makes sense.

or we could also change the conditions under which we show the red "warning". The cramer V is an estimate of an effect size but it does not say anything about significance. by computing it we also get a chi-square statistic and thus a p-value. I wouldn't show the p-value to the user because it is not reliable, as the hypotheses of the test are not verified etc. but I guess we could still rely on it to decide if it is worth calling the user's attention to this pair of columns or not.

also, we may want to implement the bias correction of the cramer v statistic wikipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple df.head() from the Cramer's computation #1179

Decouple df.head() from the Cramer's computation #1179

glemaitre commented Dec 5, 2024

Vincent-Maladiere commented Dec 6, 2024

glemaitre commented Dec 6, 2024

jeromedockes commented Dec 6, 2024

Decouple df.head() from the Cramer's computation #1179

Decouple df.head() from the Cramer's computation #1179

Comments

glemaitre commented Dec 5, 2024

Vincent-Maladiere commented Dec 6, 2024

glemaitre commented Dec 6, 2024

jeromedockes commented Dec 6, 2024