-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geom_histogram: boundary default + minor issue + doc improvement #2323
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Okay, so giving up with |
If I'm reading things correctly from the email notification version of your comment, above, @ptoche, I think you just needed to load ggplot2 (reprexes are self-contained, so you need to attach libraries, include your data, etc. Jenny Bryan's slide deck is my personal favourite, quick outline). Running the code from your OP as a reprex (plus library(ggplot2)
## data
set.seed(1)
df = data.frame(x = runif(100000, min = 0, max = 1))
## default histogram
ggplot(data = df, aes(x = x)) +
geom_histogram(color = "black", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## basic ggplot structure:
p <- ggplot(data = df, aes(x = x)) +
scale_x_continuous(breaks = seq(0, 1, 0.2))
## play around with boundary
# default histogram very similar to this:
p + geom_histogram(boundary = 0.5,
color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. # boundary = 0 leaves training line on rhs
p + geom_histogram(boundary = 0,
color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. # boundary = 1 leaves training line on lhs
p + geom_histogram(boundary = 1,
color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Created on 2018-07-19 by the reprex package (v0.2.0.9000). |
@batpigandme, thanks. Well I've managed to use reprex in the past. I did forget to load the package once, then other attempts failed for other reasons. Thanks! 👍 |
A quick fix to this is to set Mentioning that the scale |
Setting the scale limits doesn't remove the boundary issue, in fact it means that the upper and lower bins are simply removed: library(ggplot2)
x <- runif(10000)
ggplot() + geom_histogram(aes(x)) + scale_x_continuous(limits=c(0, 1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 2 rows containing missing values (geom_bar). ggplot() + geom_histogram(aes(x), boundary=0) + scale_x_continuous(limits=c(0, 1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. |
I'm sorry, I have lost track of what actually the issue is here. It is definitely possible to just manually define the bins you want, via the library(tidyverse)
set.seed(1234)
a <- tibble(num = runif(1000, -1, 1))
ggplot(data=a) +
geom_histogram(aes(x=num), breaks = seq(-1, 1, by = 2/6)) Created on 2020-04-24 by the reprex package (v0.3.0) In practice, though, I find that library(tidyverse)
set.seed(1234)
a <- tibble(num = runif(1000, -1, 1))
ggplot(data=a) +
geom_histogram(aes(x=num), binwidth = 1, center = 0.5) Created on 2020-04-24 by the reprex package (v0.3.0) |
@Fealthas you get that weird bin because the bins are right-open, left-closed (i.e. the last bin is [0.80, 1]. You can't cherry-pick one bad example and claim the default is "super bad". The histogram algorithm is surprisingly complex and it's difficult to get exactly the behaviour you desire in every possible case. See |
I don't doubt that you are right, but I guess the real issue here is that there is no easy/intuitive way to turn off the "centering" behavior that it uses by default to create the bins. Ofc you can set the center parameter, but that is a little convoluted to calculate the right value. It would be ideal if you could bind the min/max to edges of the outer bins and calculate the rest from there.
I'm not sure what you mean by this. ggplot_build()$ data claims otherwise. The last 'weird' bin is [.99,1.24] |
After reading the entire issue from top to bottom, I believe it is resolved at this time, and I'm going to close it. Below follows a summary and justification. Three points were made originally:
As far as I can see, the documentation issue (point 1) has been resolved. For point 2, we have had a discussion now and the answer is there is never going to be a default that works in all cases. If somebody can propose a concrete way to improve the default algorithm without degrading other cases, they are welcome to open an issue for that point. For point 3, the code originally posted still produces the trailing near-empty bins, but there are ways to specify the bins that avoids this problem. If somebody can create a dataset for which it is not possible to set reasonable bins manually, then also please open a new issue for that problem. library(ggplot2)
set.seed(1)
df <- data.frame(x = runif(100000, min = 0, max = 1))
# half-filled bins
ggplot(data = df, aes(x = x)) +
geom_histogram(color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. # trailing line on right
ggplot(data = df, aes(x = x)) +
geom_histogram(boundary = 0, color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. # trailing line on left
ggplot(data = df, aes(x = x)) +
geom_histogram(boundary = 1, color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. # explicit breaks work
ggplot(data = df, aes(x = x)) +
geom_histogram(breaks = seq(0, 1, by = 1/30), color = "blue", fill = "white") # setting boundary and binwidth works
ggplot(data = df, aes(x = x)) +
geom_histogram(boundary = 0, binwidth = 1/30, color = "blue", fill = "white") # setting center and binwidth works
ggplot(data = df, aes(x = x)) +
geom_histogram(center = 1/60, binwidth = 1/30, color = "blue", fill = "white") Created on 2020-04-25 by the reprex package (v0.3.0) |
1.
In
geom_histogram
is the defaultboundary
value sensible?Has this been the subject of some discussions? There are, presumably, so many use cases that it must be difficult to find a common ground. However, in the case of a continuous distribution defined on a numeric interval, e.g. the uniform distribution on
[0, 1]
the default behaviour is probably not the most desirable. A more suitable histogram may be drawn by setting the optionboundary=0
orboundary=1
.Is there a case for setting
boundary=0
as a default? This looks like whathist(df$x)
does.However, there is a minor issue when doing this: the base of the bins extends beyond where it is supposed to. Not sure if this is a minor bug or a by-product of my misusing the boundary option.
See screenshot: just above the value
1.0
the horizontal line extends too far.Still, better than the default, where the first and last bins are "wrong":
Moreover, Is there a case for allowing users to set
boundary = c(a,b)
, that is a left-right vector of values, to cater for situations where the data is restricted to[a,b]
, e.g. with the uniform distribution below?In addition:
2. The documentation for
geom_histogram
herehttp://ggplot2.tidyverse.org/reference/geom_histogram.html
refers to a
width
argument which, if I'm not mistaken, is now obsolete:boundary | A boundary between two bins. As with center, things are shifted when boundary is outside the range of the data. For example, to center on integers, use width = 1 and boundary = 0.5, even if 0.5 is outside the range of the data. At most one of center and boundary may be specified.
Using
width = 1
, as suggested, throws in an error.The text was updated successfully, but these errors were encountered: