Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geom_histogram: boundary default + minor issue + doc improvement #2323

Closed
ptoche opened this issue Nov 2, 2017 · 16 comments
Closed

geom_histogram: boundary default + minor issue + doc improvement #2323

ptoche opened this issue Nov 2, 2017 · 16 comments

Comments

@ptoche
Copy link

ptoche commented Nov 2, 2017

1.

In geom_histogram is the default boundary value sensible?

Has this been the subject of some discussions? There are, presumably, so many use cases that it must be difficult to find a common ground. However, in the case of a continuous distribution defined on a numeric interval, e.g. the uniform distribution on [0, 1] the default behaviour is probably not the most desirable. A more suitable histogram may be drawn by setting the option boundary=0 or boundary=1.

Is there a case for setting boundary=0 as a default? This looks like what hist(df$x) does.

However, there is a minor issue when doing this: the base of the bins extends beyond where it is supposed to. Not sure if this is a minor bug or a by-product of my misusing the boundary option.

See screenshot: just above the value 1.0 the horizontal line extends too far.

rplot

Still, better than the default, where the first and last bins are "wrong":

rplot01

Moreover, Is there a case for allowing users to set boundary = c(a,b), that is a left-right vector of values, to cater for situations where the data is restricted to [a,b], e.g. with the uniform distribution below?

## data
set.seed(1)
df = data.frame(x = runif(100000, min = 0, max = 1))

## default histogram
ggplot(data = df, aes(x = x)) + 
    geom_histogram(color = "black", fill = "white")

## basic ggplot structure:
p <- ggplot(data = df, aes(x = x)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.2))

## play around with boundary

# default histogram very similar to this:
p + geom_histogram(boundary = 0.5,
    color = "blue", fill = "white") 

# boundary = 0 leaves training line on rhs
p + geom_histogram(boundary = 0,
    color = "blue", fill = "white") 

# boundary = 1 leaves training line on lhs
p + geom_histogram(boundary = 1,
    color = "blue", fill = "white") 

## feature suggestion: boundary = c(0, 1)

## play around with bins or binwidth to try to fix problem

bins = 10
binwidth = (max(df$x)-min(df$x))/bins
p + geom_histogram(binwidth = binwidth, boundary = 0,
    color = "blue", fill = "white") 

p + geom_histogram(bins = bins, boundary = 0,
    color = "blue", fill = "white") 

In addition:

2. The documentation for geom_histogram here
http://ggplot2.tidyverse.org/reference/geom_histogram.html

refers to a width argument which, if I'm not mistaken, is now obsolete:

boundary | A boundary between two bins. As with center, things are shifted when boundary is outside the range of the data. For example, to center on integers, use width = 1 and boundary = 0.5, even if 0.5 is outside the range of the data. At most one of center and boundary may be specified.

Using width = 1, as suggested, throws in an error.

@hadley hadley closed this as completed in 47c3f75 Nov 6, 2017
@ptoche

This comment has been minimized.

@alanocallaghan

This comment has been minimized.

@huftis

This comment has been minimized.

@clauswilke

This comment has been minimized.

@clauswilke clauswilke reopened this Jul 11, 2018
@tidyverse tidyverse deleted a comment from alanocallaghan Jul 11, 2018
@ptoche
Copy link
Author

ptoche commented Jul 19, 2018

Okay, so giving up with reprex(), as it doesn't seem to work for me. However I can confirm that all the issues are still currently observed.

@batpigandme
Copy link
Contributor

Okay, so giving up with reprex(), as it doesn't seem to work for me. However I can confirm that all the issues are still currently observed.

If I'm reading things correctly from the email notification version of your comment, above, @ptoche, I think you just needed to load ggplot2 (reprexes are self-contained, so you need to attach libraries, include your data, etc. Jenny Bryan's slide deck is my personal favourite, quick outline).

Running the code from your OP as a reprex (plus library(ggplot2) here's what I get):

library(ggplot2)

## data
set.seed(1)
df = data.frame(x = runif(100000, min = 0, max = 1))

## default histogram
ggplot(data = df, aes(x = x)) + 
  geom_histogram(color = "black", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## basic ggplot structure:
p <- ggplot(data = df, aes(x = x)) + 
  scale_x_continuous(breaks = seq(0, 1, 0.2))

## play around with boundary

# default histogram very similar to this:
p + geom_histogram(boundary = 0.5,
                   color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# boundary = 0 leaves training line on rhs
p + geom_histogram(boundary = 0,
                   color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# boundary = 1 leaves training line on lhs
p + geom_histogram(boundary = 1,
                   color = "blue", fill = "white") 
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-07-19 by the reprex package (v0.2.0.9000).

@ptoche
Copy link
Author

ptoche commented Jul 19, 2018

@batpigandme, thanks. Well I've managed to use reprex in the past. I did forget to load the package once, then other attempts failed for other reasons. Thanks! 👍

@paleolimbot
Copy link
Member

A quick fix to this is to set limits = c(0, 1) in scale_x_continuous(). The issue is that the last bin is like 0.999999 to 1.000001 and does contain some observations, and without explicitly setting the limits there's no way for bin_breaks_bins() to know that the maximum bin should be 1.

Mentioning that the scale limits matter when calculating the bins I think is a reasonable way to close this issue. The other easy option would be to allow the user to override the x_range (falling back to scale$dimension(), which is what currently happens. If @clauswilke has an opinion on which is best, I think we can add this as a tidy-dev-day issue.

@alanocallaghan
Copy link

Setting the scale limits doesn't remove the boundary issue, in fact it means that the upper and lower bins are simply removed:

library(ggplot2)
x <- runif(10000)
ggplot() + geom_histogram(aes(x)) + scale_x_continuous(limits=c(0, 1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 2 rows containing missing values (geom_bar).

ggplot() + geom_histogram(aes(x), boundary=0) + scale_x_continuous(limits=c(0, 1))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

@Fealthas
Copy link

I ran into this problem too.
I found a fix - manually set boundary to min(data), and set binwidth manually.

library(tidyverse)
a <- tibble(num=runif(1000,-1,1))
numBins <- 6
plot <- ggplot(data=a) + geom_histogram(aes(x=num),binwidth=(sum(abs(range(a$num)))/numBins),boundary=min(a$num))
plot
ggplot_build(plot)$data

Doing this, it finally renders correctly.
image

But yeah, the behavior of geom_histogram is super broken. I'm not sure why it seems to care so much about the center of bins rather than the edges like every other plotting tool ever.

Why can't we manually set limits for the bins? I.e. set limit to c(-1,1).

@clauswilke
Copy link
Member

I'm sorry, I have lost track of what actually the issue is here. It is definitely possible to just manually define the bins you want, via the breaks argument.

library(tidyverse)

set.seed(1234)
a <- tibble(num = runif(1000, -1, 1))

ggplot(data=a) + 
  geom_histogram(aes(x=num), breaks = seq(-1, 1, by = 2/6))

Created on 2020-04-24 by the reprex package (v0.3.0)

In practice, though, I find that binwidth and center are usually more helpful.

library(tidyverse)

set.seed(1234)
a <- tibble(num = runif(1000, -1, 1))

ggplot(data=a) + 
  geom_histogram(aes(x=num), binwidth = 1, center = 0.5)

Created on 2020-04-24 by the reprex package (v0.3.0)

@Fealthas
Copy link

This is the default behavior which is super bad.

a <- tibble(num=runif(10000,0,1))
ggplot(data=a) + geom_histogram(aes(x=num),bins=5)

image
See how bins 1 and 5 are half allocated to empty space? The data only goes from 0 to 1. This is a uniform distribution - its supposed to be flat. The default behavior gives incorrect interpretation of the data, making it seem like bins 2,3,4 have more points than the rest.

Data about the bin boundaries:
image

For some reason the first bin starts at a negative number, despite it being below the minimum value in the data.

@Fealthas
Copy link

Fealthas commented Apr 24, 2020

Trying to fix it by setting a boundary has weird behavior.

p2 <- ggplot(data=a) + geom_histogram(aes(x=num),bins=5,boundary=0)
p2
ggplot_build(p2)$data

image
It tries to create this odd bin with 1
image

There seems to be no way to have a normal histogram in ggplot2 without lots of tweaking or custom break stuff. It should be a simple param change.

@hadley
Copy link
Member

hadley commented Apr 24, 2020

@Fealthas you get that weird bin because the bins are right-open, left-closed (i.e. the last bin is [0.80, 1].

You can't cherry-pick one bad example and claim the default is "super bad". The histogram algorithm is surprisingly complex and it's difficult to get exactly the behaviour you desire in every possible case.

See hist(1:3) and hist(1:4) for hist() edge cases which are undesirable.

@Fealthas
Copy link

I don't doubt that you are right, but I guess the real issue here is that there is no easy/intuitive way to turn off the "centering" behavior that it uses by default to create the bins.

Ofc you can set the center parameter, but that is a little convoluted to calculate the right value. It would be ideal if you could bind the min/max to edges of the outer bins and calculate the rest from there.

you get that weird bin because the bins are right-open, left-closed (i.e. the last bin is [0.80, 1].

I'm not sure what you mean by this. ggplot_build()$ data claims otherwise. The last 'weird' bin is [.99,1.24]

@clauswilke
Copy link
Member

After reading the entire issue from top to bottom, I believe it is resolved at this time, and I'm going to close it. Below follows a summary and justification.

Three points were made originally:

  1. The documentation has a problem.

  2. Default bins are awkward for uniform distributions.

  3. Trying to set the bins manually for uniform distributions leads to trailing near-empty bins on left or right.

As far as I can see, the documentation issue (point 1) has been resolved. For point 2, we have had a discussion now and the answer is there is never going to be a default that works in all cases. If somebody can propose a concrete way to improve the default algorithm without degrading other cases, they are welcome to open an issue for that point. For point 3, the code originally posted still produces the trailing near-empty bins, but there are ways to specify the bins that avoids this problem. If somebody can create a dataset for which it is not possible to set reasonable bins manually, then also please open a new issue for that problem.

library(ggplot2)

set.seed(1)
df <- data.frame(x = runif(100000, min = 0, max = 1))

# half-filled bins
ggplot(data = df, aes(x = x)) + 
  geom_histogram(color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# trailing line on right
ggplot(data = df, aes(x = x)) + 
  geom_histogram(boundary = 0, color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# trailing line on left
ggplot(data = df, aes(x = x)) + 
  geom_histogram(boundary = 1, color = "blue", fill = "white")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# explicit breaks work
ggplot(data = df, aes(x = x)) + 
  geom_histogram(breaks = seq(0, 1, by = 1/30), color = "blue", fill = "white")

# setting boundary and binwidth works
ggplot(data = df, aes(x = x)) + 
  geom_histogram(boundary = 0, binwidth = 1/30, color = "blue", fill = "white")

# setting center and binwidth works
ggplot(data = df, aes(x = x)) + 
  geom_histogram(center = 1/60, binwidth = 1/30, color = "blue", fill = "white")

Created on 2020-04-25 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants