Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling which ggplot2 aes(size) values to show in the legend #9

Open
lcolladotor opened this issue Oct 9, 2024 · 0 comments
Open

Comments

@lcolladotor
Copy link
Owner

Question

The original question was posted at https://courseplus.jhu.edu/core/index.cfm/go/bbs:topic.view/bbsTopicID/187436/coid/21836/ by @manamiueshima in relation to Project 3 (specifically

8. Using `ggplot2`, create a scatter plot of the average sentiment score for each album (y-axis) and the album release data along the x-axis. Make the size of each point the album sales in millions.
9. Add a horizontal line at y-intercept=0.
10. Write 2-3 sentences interpreting the plot answering the question "How has the sentiment of Taylor Swift's albums have changed over time?". Add a title, subtitle, and useful axis labels.
). I've reproduced it here:

For the graph in Part 2E (8-10), although my table shows that the album Lover has sales in the 1-million range, and the graph correctly displays a point of that size, the legend for Album Sales (in millions) only shows sizes from 2 to 7. Even when using scale_size_continuous(breaks = c(1, 2, 3, 4, 5, 6, 7)), the situation does not change. Could you please explain what might be causing this problem?

Exploration and solution

Using the data from project 3, I was able to reproduce the issue described by @manamiueshima. So I played a bit more with the arguments from ggplot2::scale_size_continuous() documented at https://ggplot2.tidyverse.org/reference/scale_size.html. Overall though, the answer is that ggplot2 is not going to show by default all the unique values of aes(size). But in a case like this one where we want to control which values to show, we can do so. We do need to use 2 arguments of ggplot2::scale_size_continuous() as you'll see further below.

Here's the R code with some comments, but you might want to scroll down to see the reprex::reprex() output.

Best,
Leo

set.seed(20241009)
df <- data.frame(
    sales = c(1.1, 2.5, 4.3, 7.21),
    released = as.Date(c(
        "2020-01-01", "2021-07-01", "2023-10-01", "2024-08-01"
    )),
    sentiment_score = rnorm(4, 2, 4)
)

## Basic plot with automatic range, limits, and breaks for the "sales"
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0)

## Let's try to specify breaks at the unique (rounded down) sales we do have
sort(floor(df$sales))
# [1] 1 2 4 7
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(breaks = c(1, 2, 4, 7))

## It didn't work. Let's also try to set the range.
range(df$sales)
# [1] 1.10 7.21
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(range  = range(df$sales), breaks = c(1, 2, 4, 7))

## Or use the limits instead of the range
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(limits = range(df$sales), breaks = c(1, 2, 4, 7))

## It doesn't work yet. But what if we round down the lowest value of "sales"
## and then round up the highest value of "sales"?
##
## Voilá! This worked!
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7))

## Using the above strategy with "range" instead of "limits" doesn't work
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(range = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7))

## Here are some other "solutions" though they include code we don't need.
##
## For example, here the line about "range" is not needed
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(
        range  = range(df$sales),
        limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7)
    )

## Similarly in this case, the code for "range" is also not needed
df %>%
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(
        range  = c(floor(min(df$sales)), ceiling(max(df$sales))),
        limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7)
    )


## R Session info
options(width = 120)
sessioninfo::session_info()

reprex::reprex() output

library("ggplot2")

set.seed(20241009)
df <- data.frame(
    sales = c(1.1, 2.5, 4.3, 7.21),
    released = as.Date(c(
        "2020-01-01", "2021-07-01", "2023-10-01", "2024-08-01"
    )),
    sentiment_score = rnorm(4, 2, 4)
)

## Basic plot with automatic range, limits, and breaks for the "sales"
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0)

## Let's try to specify breaks at the unique (rounded down) sales we do have
sort(floor(df$sales))
#> [1] 1 2 4 7
# [1] 1 2 4 7
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(breaks = c(1, 2, 4, 7))

## It didn't work. Let's also try to set the range.
range(df$sales)
#> [1] 1.10 7.21
# [1] 1.10 7.21
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(range  = range(df$sales), breaks = c(1, 2, 4, 7))

## Or use the limits instead of the range
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(limits = range(df$sales), breaks = c(1, 2, 4, 7))

## It doesn't work yet. But what if we round down the lowest value of "sales"
## and then round up the highest value of "sales"?
##
## Voilá! This worked!
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7))

## Using the above strategy with "range" instead of "limits" doesn't work
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(range = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7))

## Here are some other "solutions" though they include code we don't need.
##
## For example, here the line about "range" is not needed
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(
        range  = range(df$sales),
        limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7)
    )

## Similarly in this case, the code for "range" is also not needed
df |>
    ggplot(aes(x = released, y = sentiment_score, size = sales)) +
    geom_point() +
    ylab("Sentiment") +
    geom_hline(yintercept = 0) +
    scale_size_continuous(
        range  = c(floor(min(df$sales)), ceiling(max(df$sales))),
        limits = c(floor(min(df$sales)), ceiling(max(df$sales))),
        breaks = c(1, 2, 4, 7)
    )

## R Session info
options(width = 120)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       macOS Sonoma 14.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-10-09
#>  pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.0)
#>  colorspace    2.1-1   2024-07-26 [1] CRAN (R 4.4.0)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr         1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      1.0.0   2024-09-17 [1] CRAN (R 4.4.1)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
#>  farver        2.1.2   2024-05-13 [1] CRAN (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
#>  gtable        0.3.5   2024-04-22 [1] CRAN (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  knitr         1.48    2024-07-07 [1] CRAN (R 4.4.0)
#>  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.4.0)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  munsell       0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
#>  rmarkdown     2.28    2024-08-17 [1] CRAN (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.1   2024-07-31 [1] CRAN (R 4.4.0)
#>  xfun          0.47    2024-08-17 [1] CRAN (R 4.4.0)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Created on 2024-10-09 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant