-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving binned_residuals()
#382
Comments
@bbolker regarding the smooth line: does it make much sense to add such a line? Even for models with a larger number of observations, due to binning, there are still just a few data points. library(lme4)
#> Loading required package: Matrix
library(HSAUR3)
#> Loading required package: tools
library(performance)
gm2 <- glmer(outcome ~ treatment * visit + (1 | patientID),
data = toenail,
family = binomial)
nobs(gm2)
#> [1] 1908
binned_residuals(gm2) |> plot() + see::theme_lucid() Created on 2022-02-25 by the reprex package (v2.0.1) |
(it would look like something like this) library(lme4)
#> Loading required package: Matrix
library(HSAUR3)
#> Loading required package: tools
library(performance)
gm2 <- glmer(outcome ~ treatment * visit + (1 | patientID),
data = toenail,
family = binomial)
nobs(gm2)
#> [1] 1908
binned_residuals(gm2) |>
plot() +
see::theme_lucid() +
ggplot2::stat_smooth(ggplot2::aes(y = ybar, colour = group), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x' binned_residuals(gm2) |>
plot() +
see::theme_lucid() +
ggplot2::stat_smooth(ggplot2::aes(y = ybar), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x' Created on 2022-02-25 by the reprex package (v2.0.1) |
|
Don't know. I think I like the smoothed line, but don't have super-strong feelings. |
Maybe tell it to always use gam() with thin-plate splines and pretty tight penalty instead of loess? |
And we could add an argument that lets users optionally add such a smooth-line. |
I mean we should have a smooth line I think, but a more penalized one |
How to to this with |
I just saw that library(ggplot2)
library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
result <- binned_residuals(model)
plot(result) + geom_errorbar(aes(ymin = CI_low, ymax = CI_high, colour = group)) aa <- broom::augment(model)
gg1 <- ggplot(aa, aes(.fitted, .resid)) +
stat_summary_bin(fun.data = mean_cl_boot) +
geom_hline(yintercept = 0, lty = 2) +
geom_smooth()
gg1
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 11 rows containing missing values (geom_segment). Created on 2022-02-28 by the reprex package (v2.0.1) |
Another example: library(ggplot2)
library(performance)
library(HSAUR3)
#> Loading required package: tools
gm2 <- glm(outcome ~ treatment * visit,
data = toenail,
family = binomial)
nobs(gm2)
#> [1] 1908
binned_residuals(gm2) |>
plot() +
stat_smooth(method = "gam", ggplot2::aes(y = ybar), colour = "grey70", se = FALSE) +
geom_errorbar(aes(ymin = CI_low, ymax = CI_high, colour = group)) +
see::theme_lucid()
#> `geom_smooth()` using formula 'y ~ s(x, bs = "cs")' Created on 2022-02-28 by the reprex package (v2.0.1) |
I think the CI_low and CI_high values might be wrong as currently written: d <- suppressWarnings(lapply(1:n_bins, function(.x) {
items <- (1:length(pred))[model.binned == .x]
model.range <- range(pred[items], na.rm = TRUE)
xbar <- mean(pred[items], na.rm = TRUE)
ybar <- mean(y[items], na.rm = TRUE)
n <- length(items)
sdev <- stats::sd(y[items], na.rm = TRUE)
data.frame(
xbar = xbar,
ybar = ybar,
n = n,
x.lo = model.range[1],
x.hi = model.range[2],
se = 2 * sdev / sqrt(n)
)
}))
d <- do.call(rbind, d)
d <- d[stats::complete.cases(d), ]
# CIs
d$CI_low <- d$ybar - stats::qnorm(.975) * d$se
d$CI_high <- d$ybar + stats::qnorm(.975) * d$se The computation of |
Let me look into the Gelman/Hill book, I'm, not sure, but I think I roughly adopted the code from that book, or the arm-package. |
I think you're right, this looks better to me: library(ggplot2)
library(performance)
library(HSAUR3)
#> Loading required package: tools
#> Loading required package: tools
gm2 <- glm(outcome ~ treatment * visit,
data = toenail,
family = binomial)
binned_residuals(gm2) |>
plot() +
stat_smooth(method = "gam", aes(y = ybar), colour = "grey70", se = FALSE) +
geom_errorbar(aes(ymin = CI_low, ymax = CI_high, colour = group)) +
see::theme_lucid()
#> `geom_smooth()` using formula 'y ~ s(x, bs = "cs")' Created on 2022-03-01 by the reprex package (v2.0.1) |
Yeah, Gelman tends to approximate the critical value for 95% intervals as "2", so that's what jumped out to me in the code |
The range of the CI is then identical to the range of error bound per bin - it's just that due to the point estimate, error bars of the point estimates cross the error bounds. So, the "se" refers to the error bounds, probably not to the point estimates (unless these are the same) |
They are the same. The SE is the standard error of the point estimate (it's the typical standard error of the mean SD / sqrt(N) ). It's just a question of whether the error bands are drawn around the point estimate or the null hypothesis value. I think I like it with both the error band around zero and the error bars on the points. The error bars help to interpret the smooth line I think. What do you think? |
Yes, I'd also say we should keep the error bars. What about the smooth line, you were suggesting using stat_smooth(method = "gam", aes(y = ybar), colour = "grey70", se = FALSE) |
btw, when the library(ggplot2)
library(performance)
library(HSAUR3)
#> Loading required package: tools
gm2 <- glm(outcome ~ treatment * visit,
data = toenail,
family = binomial)
binned_residuals(gm2, term = "visit") |>
plot() Created on 2022-03-01 by the reprex package (v2.0.1) |
I think that looks good. Should we make the line color the same as in the other plots? (Green I think?) |
(its already green) |
library(performance)
library(HSAUR3)
#> Loading required package: tools
gm2 <- glm(outcome ~ treatment * visit,
data = toenail,
family = binomial)
binned_residuals(gm2) |> plot() Created on 2022-03-01 by the reprex package (v2.0.1) |
Looks good to me! Let's add theme_modern() to be line the others in check_model() |
Yeah, this was the standalone-plot. for check_model, this is the default: Don't forget #376 (comment) :-) |
Thanks for the reminder |
Also going to make a better PP check plot for categorical responses |
I might like to reopen this. The first plot is from It would also be nice to have an option to do the binned-residuals plot on a logit-logit scale, as that would show nonlinear patterns on the link scale more reliably ... library(performance)
library(ggplot2); theme_set(theme_bw())
data("kyphosis", package = "rpart")
kyphosis <- transform(
kyphosis,
Kyphosis = as.numeric(factor(Kyphosis))-1
)
m <- glm(
Kyphosis~Age+Start+Number,
data=kyphosis,
family = binomial(link = "logit")
)
check_model(m)
plot(binned_residuals(m)) |
The missing points and error bars might be due to them going out of bounds? Could you elaborate on the details of what you're suggesting with logit-logit? |
At least for the x axis, it would be nice to be able to plot on the log-odds rather than the probability scale. Not sure whether this makes sense for the y-axis or not (which shows residuals, rather than predictions — not sure what kinds of residuals these are, or whether it makes sense to think about them on the link or the linear predictor scale; if they're Pearson, then it's probably OK). |
Dot issue should be fixed: library(performance)
library(ggplot2)
theme_set(theme_bw())
data("kyphosis", package = "rpart")
kyphosis <- transform(
kyphosis,
Kyphosis = as.numeric(factor(Kyphosis)) - 1
)
m <- glm(
Kyphosis ~ Age + Start + Number,
data = kyphosis,
family = binomial(link = "logit")
)
check_model(m) plot(binned_residuals(m)) check_model(m, show_dots = FALSE) plot(binned_residuals(m, show_dots = FALSE)) # or
plot(binned_residuals(m), show_dots = FALSE) You need: packageVersion("see")
#> [1] '0.8.0.6'
packageVersion("performance")
#> [1] '0.10.6.2' (should be available via |
How would we achieve using a log-odds scale? Binned residuals are based on fitted values, and the "mean" of 0 and 1 for a certain range of the predictors, i.e. everything is on the probability scale by default. |
I agree that putting the y-axis on the log-odds scale isn't practical (and if the machinery is using Pearson residuals — I haven't checked — that's probably fine). Thanks for fixing the other issue. The main other adjustment I think would be worthwhile is something to handle the failure of the Gaussian approximation - either bootstrap or shifted exact CIs (what's going on in this case with those lowest four bins is that there are 0/9 successes). Looks like I've been hacking around on my fork. Bootstrapped residuals didn't make much difference. I'm not sure whether shifted the exact binomial CI makes sense, but it seems to ... Note that the example given in Here's what I get with my hacked version (shifted exact CIs and a log-odds scale y axis) (there are a few issues left: (1) I haven't adjusted the envelope; (2) I'm not sure what's up with the last bin where the range doesn't overlap the points, maybe shifting doesn't always work? (3) I get "Warning in eval_tidy(x[[2]], data, env) : restarting interrupted promise evaluation" warnings for reasons I don't understand ... |
I have opened a PR (#641) - I first tried to open a PR based on your fork, but this somehow did not work. Regarding:
This is due to stats:::binom.test(sum(y0[items]), n)$conf.int - fv
I think this is also the reason for
Maybe you can directly commit to #641? |
What about this? library(performance)
data("kyphosis", package = "rpart")
kyphosis <- transform(
kyphosis,
Kyphosis = as.numeric(factor(Kyphosis)) - 1
)
m <- glm(
Kyphosis ~ Age + Start + Number,
data = kyphosis,
family = binomial(link = "logit")
)
plot(binned_residuals(m, ci_type = "exact"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "gaussian"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "boot"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "exact", residuals = "response"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "exact", residuals = "deviance"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "exact", residuals = "pearson"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "gaussian", residuals = "response"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "gaussian", residuals = "deviance"), show_dots = TRUE) plot(binned_residuals(m, ci_type = "gaussian", residuals = "pearson"), show_dots = TRUE) Created on 2023-10-24 with reprex v2.0.2 |
@bbolker We should decide on sensible defaults. For now, I used deviance residuals and exact CI. |
This isn't the place, but I may have some Opinions about
binned_model
(I would like a smoothed line added, binomial CIs on the points; I would rather have binomial CIs on the individual points and a confidence interval on the smooth rather than have a dichotomizing region that indicates 'good' and 'bad' points ...). I see that this is copying whatarm::binnedplot
does though ...Originally posted by @bbolker in #376 (comment)
The text was updated successfully, but these errors were encountered: