03-Describe-visualize.Rmd

# Descriptives and visualizations

## Overview

In this section, I describe and visualize the variables.
We got variables on the person-level and on the wave level.

**Person-level**

* Age and gender (`age`, `gender`)
* What region of the UK participants are from (`region`)
* Whether they've used a medium in the three months before the first wave (variables starting with `filter`)
* To what extent they identify with a medium (variables ending with `identity`)

**Wave-level**

* How much they used a medium per wave (variables ending with `time`)
* Their well-being per wave (`well_being`)

## Person-level

Let's have a look at the final sample.
Overall, our sample size is N = `r working_file %>% group_by(id) %>% slice(1) %>% ungroup() %>% nrow()`.
Participants, on average, were *M* = `r round(working_file %>% group_by(id) %>% pull(age) %>% mean, digits = 1)` years old, *SD* = `r round(working_file %>% group_by(id) %>% pull(age) %>% sd, digits = 1)`, see Figure \@ref(fig:visualize-age).
The gender distribution is pretty equal (`r working_file %>% group_by(id) %>% slice(1) %>% ungroup() %>% filter(gender == "Female") %>% nrow()` women (`r round(working_file %>% group_by(id) %>% slice(1) %>% filter(gender == "Female") %>% nrow() / length(unique(working_file$id)), digits = 4) * 100`%), `r working_file %>% group_by(id) %>% slice(1) %>% ungroup() %>% filter(gender == "Male") %>% nrow()` men, and `r working_file %>% group_by(id) %>% slice(1) %>% ungroup() %>% filter(gender == "Other") %>% nrow()` non-binary participants).
```{r visualize-age, echo=FALSE, fig.cap="Distribution of age"}
age <- 
  describe(
    working_file,
    "age",
    trait = TRUE
  )

single_cloud(
  working_file,
  age,
  "age",
  "#009E73",
  title = NULL,
  trait = TRUE
)
```

Alright, next let's have a look again (now that we have the final sample) at how many people had used a medium in the three months before wave 1 (Table \@ref(tab:table-media-filters)).
```{r table-media-filters, echo=FALSE}
working_file %>% 
  filter(wave == 1) %>%
  select(id, wave, starts_with("filter")) %>% 
  pivot_longer(
    cols = c(-id, -wave),
    names_to = "Medium",
    values_to = "Used"
  ) %>% 
  mutate(
    Medium = str_remove(Medium, "filter_")
  ) %>% 
  group_by(Medium) %>% 
  summarise(
    Used = sum(Used),
    "Not Used" = n() - Used,
    Proportion = round(Used / n(), digits = 3) * 100
  ) %>% 
  knitr::kable(
    .,
    caption = "Frequency of how much a medium was used at wave 1"
  )
```

Let's inspect the sample characteristics (age and gender) by how many surveys a person filled out (Table \@ref(tab:describe-per-wave)).
So we're just looking at whether there are pronounced differences in sample charateristics as people dropped out of the study.
Looks pretty stable to me, especially given the samller sample sizes for those who dropped out early.
```{r describe-per-wave, echo=FALSE}
temp %>% 
  group_by(id) %>%
  mutate(Waves = n()) %>%
  slice(1) %>% 
  ungroup() %>%
  group_by(Waves) %>% 
  summarise(
    N = n(),
    `Mean age` = mean(age, na.rm = TRUE),
    `SD age` = sd(age, na.rm = TRUE),
    `Women` = sum(gender=="Female"),
    Men = sum(gender=="Male"),
    Other = sum(gender=="Other"),
    `Proportion women (%)` = round(Women / n(), digits = 2) * 100
  ) %>% 
  knitr::kable(
    .,
    caption = "Demographics by number of filled out surveys"
  )

# remove tmp file
rm(tmp)
```

Now let's have a look at how much people identified with each medium.
Note that people weren't asked those identity questions if they indicated that they hadn't used a medium at wave 1.
```{r describe-identities, echo=FALSE}
# custom function, see setting up page
identity_descriptives <- 
  bind_rows(
    describe(
      working_file,
      "music_identity",
      trait = TRUE
    ),
    describe(
      working_file,
      "tv_identity",
      trait = TRUE
    ),
    describe(
      working_file,
      "films_identity",
      trait = TRUE
    ),
    describe(
      working_file,
      "games_identity",
      trait = TRUE
    ),
    describe(
      working_file,
      "e_publishing_identity",
      trait = TRUE
    )
  ) %>% 
  mutate(
    across(
      mean:cihigh,
      ~ round(.x, digits = 2)
    )
  )

knitr::kable(
  identity_descriptives,
  caption = "Descriptive information for identity scales"
)
```

Figure \@ref(fig:identity-distributions) shows the distributions of those identity variables.
Apparently gamers identified more with games than for example music listeners identified with music.
```{r identity-distributions, echo=FALSE, fig.cap="Distribution of identity variables"}
# plot_grid is from cowplot
plot_grid(
  single_cloud(
    working_file,
    identity_descriptives,
    "music_identity",
    "#999999",
    title = "Music",
    trait = TRUE
  ),
  single_cloud(
    working_file,
    identity_descriptives,
    "tv_identity",
    "#E69F00",
    title = "TV",
    trait = TRUE
  ),
  single_cloud(
    working_file,
    identity_descriptives,
    "films_identity",
    "#56B4E9",
    title = "Films",
    trait = TRUE
  ),
  single_cloud(
    working_file,
    identity_descriptives,
    "games_identity",
    "#009E73",
    title = "Games",
    trait = TRUE
  ),
  single_cloud(
    working_file,
    identity_descriptives,
    "e_publishing_identity",
    "#F0E442",
    title = "E-publishing/books",
    trait = TRUE
  ),
  ncol = 2
)
```

## Wave-level

Let's inspect the descriptive information for the different media.
Table \@ref(tab:describe-time-with-zeros) shows that people used TV, music, and films the most, but spent little time on magazines and audio books.
```{r describe-time-with-zeros, echo=FALSE}
times_descriptives_with_zeros <- 
  bind_rows(
    describe(
      working_file,
      "music_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "tv_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "films_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "games_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "books_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "magazines_time",
      trait = FALSE
    ),
    describe(
      working_file,
      "audiobooks_time",
      trait = FALSE
    )
  ) %>% 
  mutate(
    across(
      mean:cihigh,
      ~ round(.x, digits = 2)
    )
  )

knitr::kable(
  times_descriptives_with_zeros,
  caption = "Descriptive information for use time (including zero estimates)"
)
```

Figure \@ref(fig:time-distributions-with-zeros) shows the distributions of those time variables.
Note that most distributions are zero-inflated with a heavy skew.
```{r time-distributions-with-zeros, echo=FALSE, fig.cap="Distribution of time variables (with zeros)"}
plot_grid(
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "music_time",
    "#999999",
    title = "Music",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "tv_time",
    "#E69F00",
    title = "TV",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "films_time",
    "#56B4E9",
    title = "Films",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "games_time",
    "#009E73",
    title = "Games",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "books_time",
    "#F0E442",
    title = "E-publishing/books",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "magazines_time",
    "#0072B2",
    title = "Magazines",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    times_descriptives_with_zeros,
    "audiobooks_time",
    "#D55E00",
    title = "Audiobooks",
    trait = FALSE
  ),
  ncol = 2
)
```

Let's also inspect those summary statistics and distributions without zeros.
Table \@ref(tab:describe-time-without-zeros) shows the same pattern was with zeros, but overall higher means.
Interestingly, those who listen to audiobooks report to do so for a long time per day.
```{r describe-time-without-zeros, echo=FALSE}
# temp file without zeros
temp <- 
  working_file %>% 
  mutate(
    across(
      ends_with("time"),
      ~ na_if(.x, 0)
    ) 
  )

times_descriptives_without_zeros <- 
  bind_rows(
    describe(
      temp,
      "music_time",
      trait = FALSE
    ),
    describe(
      temp,
      "tv_time",
      trait = FALSE
    ),
    describe(
      temp,
      "films_time",
      trait = FALSE
    ),
    describe(
      temp,
      "games_time",
      trait = FALSE
    ),
    describe(
      temp,
      "books_time",
      trait = FALSE
    ),
    describe(
      temp,
      "magazines_time",
      trait = FALSE
    ),
    describe(
      temp,
      "audiobooks_time",
      trait = FALSE
    )
  ) %>% 
  mutate(
    across(
      mean:cihigh,
      ~ round(.x, digits = 2)
    )
  )

knitr::kable(
  times_descriptives_without_zeros,
  caption = "Descriptive information for use time (excluding zero estimates)"
)
```

Figure \@ref(fig:time-distributions-without-zeros) shows a less skewed distribution once we remove zeros.
```{r time-distributions-without-zeros, echo=FALSE, fig.cap="Distribution of time variables (without zeros)"}
plot_grid(
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "music_time",
    "#999999",
    title = "Music",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "tv_time",
    "#E69F00",
    title = "TV",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "films_time",
    "#56B4E9",
    title = "Films",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "games_time",
    "#009E73",
    title = "Games",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "books_time",
    "#F0E442",
    title = "E-publishing/books",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "magazines_time",
    "#0072B2",
    title = "Magazines",
    trait = FALSE
  ),
  single_cloud(
    temp,
    times_descriptives_without_zeros,
    "audiobooks_time",
    "#D55E00",
    title = "Audiobooks",
    trait = FALSE
  ),
  ncol = 2
)

# remove temp file
rm(temp)
```

As a next step, let's see how use developed over time.
Figure \@ref(fig:use-over-time) shows that most media stay stable, except maybe an overall spike in wave 2.
```{r use-over-time, echo=FALSE, fig.cap="Average use over time per medium"}
# get summary stats with CI per wave in the long format
time_per_wave <-
  working_file %>%
  group_by(wave) %>% 
  summarise(
    across(
      c(ends_with("time")),
      list(
        mean = ~ mean(.x, na.rm = TRUE),
        cilow = ~Rmisc::CI(na.omit(.x))[[3]], # lower CI
        cihigh = ~Rmisc::CI(na.omit(.x))[[1]] # upper CI
      )
    )
  ) %>% 
  pivot_longer(
    -wave,
    names_to = "Medium",
    values_to = "value"
  ) %>% 
  extract(Medium, into = c("Medium", "metric"), "(.*)_([^_]+$)") %>% # separate by last occurrence of underscore
  pivot_wider(
    names_from = "metric",
    values_from = "value"
  ) %>% 
  mutate(
    Medium = str_remove(Medium, "_time"),
    Medium = tools::toTitleCase(Medium) # capitalize first letter
  )
  

ggplot(
  time_per_wave,
  aes(
    x = as.factor(wave),
    y = mean,
    color = Medium,
    group = Medium
  )
) +
  geom_line(size = 1) +
  geom_errorbar(
    aes(
      ymin = cilow,
      ymax = cihigh,
      width = 0
    )
  ) +
  scale_color_brewer(palette="Dark2") +
  xlab("Wave") +
  ylab("Average use per day (in h)")
```

Last, let's see the overall descriptives of the well-being variables.
Table \@ref(tab:describe-well-being) shows that both life satisfaction and affect were above the mid-point of the scale - even though for affect, it's a close call.
```{r describe-well-being, echo = FALSE}
well_being <- 
  bind_rows(
    describe(
      working_file,
      "affect",
      trait = FALSE
    ),
    describe(
      working_file,
      "life_satisfaction",
      trait = FALSE
    )
  )  


knitr::kable(
  well_being %>% 
    mutate(
      across(
        mean:cihigh,
        ~ round(.x, digits = 2)
      )
    ),
  caption = "Well-being descriptives"
)
```

Let's plot the distribution of both affect and life satisfaction in Figure \@ref(fig:well-being-distribution).
```{r well-being-distribution, echo=FALSE, fig.cap="Distribution of well-being across waves and participants"}
plot_grid(
  single_cloud(
    working_file,
    well_being,
    "affect",
    "#009E73",
    title = "Affect",
    trait = FALSE
  ),
  single_cloud(
    working_file,
    well_being,
    "life_satisfaction",
    "#E69F00",
    title = "Life satisfaction",
    trait = FALSE
  )
)
```

Last, let's look at how well-being develops over time.
Figure \@ref(fig:well-being-over-time) shows that life satisfaction appears stable: note the y-axis is on decimal points (original range: 0 to 10).
In contrast, affect has significantly decreased, mostly because anxiety has decreased (not shown on graph).
```{r well-being-over-time, echo=FALSE, fig.cap="Average estimates over time per medium"}
# get summary stats with CI per wave in the long format
well_being_per_wave <-
  working_file %>%
  group_by(wave) %>% 
  summarise(
    across(
      c(affect, life_satisfaction),
      list(
        mean = ~ mean(.x, na.rm = TRUE),
        cilow = ~Rmisc::CI(na.omit(.x))[[3]], # lower CI
        cihigh = ~Rmisc::CI(na.omit(.x))[[1]] # upper CI
      )
    )
  ) %>% 
  pivot_longer(
    -wave,
    names_to = c("Variable", "Metric"),
    values_to = "Value",
    names_pattern = "(.*)_([^_]+$)" # life_satisfaction_mean has two underscores
  ) %>% 
  pivot_wider(
    names_from = "Metric",
    values_from = "Value"
  )

ggplot(
  well_being_per_wave,
  aes(
    x = as.factor(wave),
    y = mean,
    color = Variable,
    group = Variable
  )
) +
  geom_line(
    size = 1
  ) +
  geom_errorbar(
    aes(
      ymin = cilow,
      ymax = cihigh,
      width = 0
    )
  ) +
  scale_color_brewer(palette="Dark2") +
  xlab("Wave") +
  ylab("Average well-being")
```

## Correlation matrices

Let's have a look at the row correlations between the media use variables and well-being.
Figure \@ref(fig:correlation-matrix-use-well-being) shows that a) most media use is correlated, and b) there are negative, but very small correlations between well-being and media use.
The significance is negligible here just because the sample is so large.
That said, we haven't done any grouping yet, so the correlation matrix treats all observations as independent, which is clearly not the case.
```{r correlation-matrix-use-well-being, echo=FALSE, cache=TRUE, message=F, warning=F, fig.cap="Correlation matrix of hours spent with a medium and well-being",fig.dim=c(14,14)}
ggpairs(
  data = working_file %>%
    select(
      contains("time"),
      affect,
      life_satisfaction
    ) %>%
    # remove _time appendix to make variable names shorter
    rename_with(
      ~ str_remove(.x, "_time"),
      ends_with("time"),
    ),
  lower = list(
    continuous = lm_function # custom helper function
  ),
  diag = list(
    continuous = dens_function # custom helper function
  )
) +
  theme(
    axis.line=element_blank(),
    axis.text.x=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks=element_blank(),
    axis.title.y=element_blank(),
    axis.title.x=element_blank(),
    legend.position="none",
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    plot.background=element_blank(),
    strip.background = element_blank()
  )
```

At the request of a reviewer, we can also look at correlations per age group.
We leave a formal analysis up to readers, who can estimate models with an interaction term.
Here, I merely plot the slope between total media time and well-being by age groups.
In Figure \@ref(fig:relation-by-age), we see that there are differences in age on well-being, but not in slopes.
```{r relation-by-age, echo=FALSE, warning=FALSE, message=FALSE, fig.cap="Relation between total time and well-being by age"}
working_file %>%
  mutate(
    age_group = case_when(
      age < 30 ~ "16-29",
      age >= 30 ~ "30+"
    )
  ) %>% 
  select(id, age_group, total_time, affect, life_satisfaction) %>% 
  pivot_longer(
    .,
    cols = c("affect", "life_satisfaction"),
    names_to = "Measure",
    values_to = "score"
  ) %>% 
  ggplot(
    .,
    aes(
      x = total_time,
      y = score,
      color = age_group
    )
  ) +
  ylim(0, 11) +
  geom_smooth(method = "lm") +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  facet_wrap(~ Measure) +
  theme_cowplot()
  
```

The reviewer was also interested in different lags.
However, to see each correlation at each wave with each lag for each medium would be 6 (lags medium) x 6 (lags well-being) x 2 (well-being measures) x 7 (media types) = `r 6*6*2*7` correlations.
Therefore, Figure \@ref(fig:correlations-per-lag) shows the correlations of total time with all media with affect at the last wave for total time at each wave.
We see little differences in the slopes, meaning that a different lag for the analysis would probably not change the results much.
However, we invite readers to run the analysis themselves with a different lag.
```{r correlations-per-lag, echo=FALSE, warning=FALSE, message=FALSE, fig.cap="Correlation between affect at the last wave and total time at each wave"}
working_file %>% 
  select(id, wave, total_time, affect) %>% 
  pivot_wider(
    id_cols = id,
    names_from = wave,
    values_from = total_time,
    names_glue = "total_time_{wave}"
  ) %>% 
  # add affect at wave 6
  left_join(
    .,
    working_file %>%
      filter(wave==6) %>% 
      select(id, affect)
  ) %>% 
  pivot_longer(
    cols = c(-id, -affect),
    names_to = "lag",
    values_to = 'total_time'
  ) %>% 
  mutate(lag = parse_number(lag)) %>% 
  ggplot(
    .,
    aes(
      x = total_time,
      y = affect
    )
  ) +
  geom_smooth(method = "lm", color = "black") +
  facet_wrap(~ lag) +  
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme_cowplot() +
  xlab("Total time at each wave") +
  ylab("Affect at the last wave")
```


## Plots for Paper

For the paper, I want to avoid confronting readers with blocks of numbers.
Therefore, I'll try to put as much info as possible into figures rather than tables or in-text reports of numbers.

I think the following plots are necessary for the paper:

1. A plot of the response rate per wave, simply as a descriptive measure.
2. A plot showcasing the exclusions. I'll describe the exclusion criteria in the paper, but don't want to overwhelm readers with numbers for each.
3. A plot showing the distributions, M, and SD of media use for different categories.
4. Same plot, but for well-being.
5. A plot of the results.

First, a plot for the response rate for each wave.
We already showed response rates as a table in the processing section, so we can re-use the `completion_table` object for a plot.
Not sure about the colors yet, might go all black for this one.
```{r response-rate-plot}
ggplot(
  data = completion_table %>% 
    rename(
      Wave = Waves,
      `Participants per wave (%)` = `Participants per wave`
    ) %>% 
    mutate(
      Wave = as.factor(Wave)
    ),
  aes(
    x = Wave,
    y = `Participants per wave (%)`#,
    # color = Wave,
    # fill = Wave
  )
) +
  geom_segment(
    aes(
      xend = Wave,
      y = 0,
      yend = `Participants per wave (%)`
    ),
    size = 1
  ) +
  geom_point(
    size = 2
  ) +
  geom_text(
    aes(
      label = paste0(`Participants per wave (%)`, "(", `Frequency per wave`, "%)"),
      y = `Participants per wave (%)` + 160
    )
  ) +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    legend.position = "none"
  ) -> figure1

figure1

# create figure folder
dir.create("figures/", FALSE, TRUE)

ggsave(
  filename = here("figures", "figure1.png"),
  plot = figure1,
  width = 21 * 0.8,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)
```

Next, I plot the exclusions.
There's two ways to go about this.
Either I show the absolute number of participants and observations that we exclude with each criterion in relation to the total sample size.
I do that below.
Note that getting those exclusions in a nice format requires some wrangling.
I already created the `exclusion_plot_data` object in the processing section.
Here, I turn it into the long format first.
```{r exclusions-plot1}
# to long format for plotting
exclusion_plot_data <-
  exclusion_plot_data %>%
  filter(!str_detect(Exclusion, "Total")) %>% # don't include the total numbers for now
  rename(Participants = PPs) %>% # nicer name
  pivot_longer(
    Participants:Observations,
    names_to = "Measure",
    values_to = "Value"
  ) %>% 
  mutate( # create new variable that's stable across exclusions
    Total = case_when(
      Measure == "Participants" ~ exclusion_plot_data %>% filter(Type == "Total") %>% pull(PPs),
      TRUE ~ exclusion_plot_data %>% filter(Type == "Total") %>% pull(Observations),
    ),
    `Total Before` = case_when(
      Measure == "Participants" ~ exclusion_plot_data %>% filter(Type == "Total Before") %>% pull(PPs),
      TRUE ~ exclusion_plot_data %>% filter(Type == "Total Before") %>% pull(Observations),
    ) 
  ) %>% 
  mutate(
    across(
      Type:Measure,
      as.factor
    )
  )

# https://stackoverflow.com/questions/11889625/annotating-text-on-individual-facet-in-ggplot2
# I want to show the total numbers once, not for each facet/criterion combination, which is why I create a data frame that only has a matching value for the positions in the plot that I want
label_positions <- 
  tibble(
    Type = rep("Wave-level", 2),
    Exclusion = c(3, 3), # on the right, where wave-level exclusions are
    Measure = c("Participants", "Observations")
  ) %>% 
  mutate(
    across(
      everything(),
      as.factor
    )
  )

# add the total values
label_positions <- 
  left_join(
    label_positions,
    exclusion_plot_data %>% select(-Value)
  )

# then plot
ggplot(
  exclusion_plot_data,
    aes(
      x = Exclusion,
      y = Value,
      color = Type
    )
  ) +
  geom_segment(
    aes(
      x = Exclusion,
      xend = Exclusion,
      y = 0,
      yend = Value
    ),
    alpha = 1,
    size = 1
  ) +
  facet_grid(Measure ~ Type, scales = "free") +
  geom_point(alpha = 1, size = 2) +
  geom_text(
    aes(
      label = paste0("-", `Total Before` - Value),
      x = as.numeric(Exclusion) + 0.35,
      y = Total * 0.1
    )
  ) +
  geom_hline(
    aes(
      yintercept = Total
    ),
    linetype = "dashed"
  ) +
  geom_hline(
    aes(
      yintercept = `Total Before`
    ),
    linetype = "solid"
  ) +
  geom_text(
    data = label_positions,
    aes(
      label = `Total`,
      x = 3.3,
      y = Total * 1.05
    ),
    color = "black"
  ) +
  geom_text(
    data = label_positions,
    aes(
      label = `Total Before`,
      x = 3.3,
      y = `Total Before` * 1.05
    ),
    color = "black"
  ) +
  xlab("Exclusion Criterion") +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    plot.background=element_blank(),
    strip.background = element_blank()
  ) -> exclusion_figure1

exclusion_figure1
```

Alternatively, we could emphasize the contributions of each criterion in relation to the other criteria.
Not sure I like this because it doesn't show the total sample size and the reduction in total sample size.
Then again, could simply add those to the figure captions.
```{r exclusions-plot2}
ggplot(
  exclusion_plot_data %>% 
    mutate(
      Value = `Total Before` - Value
    ),
    aes(
      x = Exclusion,
      y = Value,
      color = Type
    )
  ) +
  geom_segment(
    aes(
      x = Exclusion,
      xend = Exclusion,
      y = 0,
      yend = Value
    ),
    alpha = 1,
    size = 1
  ) +
  facet_grid(Measure ~ Type, scales = "free") +
  geom_point(alpha = 1, size = 2) +
  geom_text(
    aes(
      label = paste0(round(Value/`Total Before`, digits = 4) * 100, "%"),
      y = Value
    ),
    vjust = 1,
    hjust = -0.2,
    size = 3
  ) +
  xlab("Exclusion Criterion") +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    plot.background=element_blank(),
    strip.background = element_blank()
  ) -> exclusion_figure2

exclusion_figure2
```

Next, I plot the distributions of the use variables.
I want to show distributions, but there are a couple of single large values, so violin or density plots will look strange with a long tail.
Raincloud plots could solve that to a degree, but they duplicate information.
I'll go with a doptplot, namely beeswarm plots.

Like before, because I facet, I'll turn the data into the long format first.
I'm also not sure whether I like vertical or horizontal beeswarms, so I'll do both.
```{r distributions-time-plot1, warning=FALSE, message=FALSE}
# temporary data file in long format
dat <- 
  working_file %>% 
  pivot_longer(
    contains("_time"), # only the time variables
    names_to = "measure",
    values_to = "Hours per day"
  ) %>% 
  mutate(
    measure = as.factor(str_to_title(str_remove(measure, "_time"))), # prettier factor levels
    measure = fct_recode(
      measure,
      "TV" = "Tv"
    ),
    measure = fct_relevel(
      measure,
      c(
        "Music",
        "TV",
        "Films",
        "Games",
        "Books",
        "Magazines",
        "Audiobooks",
        "Total"
      )
    )
  )

# summary stats for plotting means over time
dat_summary <- 
  dat %>% 
  group_by(wave, measure) %>% 
  summarise(
    mean = mean(`Hours per day`, na.rm = T),
    sd = sd(`Hours per day`, na.rm = T)
  ) %>% 
  mutate(
    across(
      mean:sd,
      ~ round(.x, digits = 1)
    )
  )

# let's try horizontal first (the commented out section adds a vertical line where the mean is)
ggplot(
  dat,
  aes(
    x = `Hours per day`,
    y = 1,
    color = measure,
    fill = measure
  )
) +
  geom_quasirandom(groupOnX=FALSE, size = 0.1, alpha = 0.5) +
  # geom_vline(
  #   data = dat_summary,
  #   aes(
  #     xintercept = mean,
  #     color = measure
  #   ),
  #   linetype = "dashed"
  # ) +
  geom_point(
    data = dat_summary,
    aes(
      x = mean,
      y = 0.55
    ),
    shape = 25,
    size = 2
  ) +
  facet_grid(measure ~ wave) +
  geom_text(
    data = dat_summary,
    aes(
      x = 20,
      y = 1.4,
      label = paste0("M = ", mean)
    ),
    size = 3,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = 19.6,
      y = 1.3,
      label = paste0("SD = ", sd)
    ),
    size = 3,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.line.y = element_blank(),
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.1

figure2.1
```

I guess the trendline makes it easier to see developments (or lack thereof) over time.
```{r distributions-time-plot2, warning=FALSE, message=FALSE}
ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  geom_line(
    data = dat_summary,
    aes(
      y = mean
    ),
    size = 1
  ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -2.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  xlab("Wave") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.2

figure2.2
```

Then again, for distributions with a small mean, the trendline blocks everything.
Plus, because total time extends the y-axis, there's lots of white space.
I'll give total time its separate plot to solve that issue.
I also won't have trendlines for the separate media, only for the total (too lazy to write the above into a function, so I'll copy paste the code from above.).
```{r distributions-time-plot3, warning=FALSE, message=FALSE}
ggplot(
  dat %>% 
    filter(measure != "Total"),
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  # geom_line(
  #   data = dat_summary,
  #   aes(
  #     y = mean
  #   ),
  #   size = 1
  # ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary %>% 
      filter(measure != "Total"),
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure != "Total"),
    aes(
      x = as.factor(wave),
      y = -2.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  xlab("Wave") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.3

figure2.3

ggsave(
  here("figures", "figure2.png"),
  plot = figure2.3,
  width = 21 * 0.8,
  height = 29.7 * 0.8,
  units = "cm",
  dpi = 300
)
```

Then a separate plot for total time.
```{r distributions-total-time-plot, warning=FALSE, message=FALSE}
ggplot(
  dat %>% 
    filter(measure == "Total"),
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2, color = "#CC79A7") +
  geom_line(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      y = mean
    ),
    size = 0.5,
    color = "#CC79A7"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 3,
    color = "black"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      x = as.factor(wave),
      y = -2.4,
      label = paste0("SD = ", sd)
    ),
    size = 3,
    color = "black"
  ) +
  theme_cowplot() +
  xlab("Wave") +
  ylab("Total hours per day") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure3

figure3

ggsave(
  here("figures", "figure3.png"),
  plot = figure3,
  width = 21 * 0.9,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)
```

Then, I need to think of a way to visualize use vs. nonuse over time.
Bar graphs could work, but that would mean a lot of bar graphs (one per medium) per wave, which is probably overwhelming.
Plus, if I show raw counts, it could be confusing as well, simply because height of the bar != number of participants.
If everyone at wave 1 used only one medium, but at wave two everyone used all media, it'll look like an increase in sample size.

So proportions make the most sense, simple and straightforward.
```{r proportions-use-plot, warning=FALSE, message=FALSE}
# get proportions of users per medium and wave
dat <- 
  working_file %>% 
  pivot_longer(
    contains("_time"), # only the time variables
    names_to = "measure",
    values_to = "Hours per day"
  ) %>% 
  mutate(
    # get use vs. no use in long format
    used = if_else(`Hours per day` == 0, 0, 1)
  ) %>% 
  # no need for total time
  filter(used == 1, measure != "total_time") %>%
  # remove "tome appendix
  mutate(measure = as.factor(str_to_title(str_remove(measure, "_time")))) %>% 
  # count how often a medium was used
  count(wave, measure, used) %>% 
  # add total N per wave (of only those that will be included in analysis)
  left_join(
    .,
    working_file %>% 
      count(wave, name = "N")
  ) %>% 
  # get proprotion
  mutate(
    proportion = round(n / N, digits = 2) * 100,
    measure = fct_recode(measure, "TV" = "Tv"),
    measure = fct_relevel(
      measure,
      c(
        "Music",
        "TV",
        "Films",
        "Games",
        "Books",
        "Magazines",
        "Audiobooks"
      )
    )
  )

ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = proportion,
    group = measure,
    color = measure
  )
) +
  geom_point(size = 3) +
  geom_line(size = 1) +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  ylab("% of people who used medium") +
  xlab("Wave") +
  geom_dl(aes(label = measure), method = list(dl.trans(x = x + 0.25), "last.points", cex = 0.6)) +
  theme(
    legend.position = "none"
  ) -> figure4

figure4

ggsave(
  here("figures", "figure4.png"),
  plot = figure4,
  width = 21 * 0.9,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)
```


Next, I'll do a beeswarm plot for well-being.
```{r distributions-time-plot, warning=FALSE, message=FALSE}
# temporary data file in long format
dat <- 
  working_file %>% 
  pivot_longer(
    c(affect, life_satisfaction), # only the aggregated well-being variables
    names_to = "measure",
    values_to = "Value"
  ) %>% 
  mutate(
    measure = fct_recode(
      as.factor(measure),
      "Affect" = "affect",
      "Life satisfaction" = "life_satisfaction"
    )
  )

# summary stats for plotting means over time
dat_summary <- 
  dat %>% 
  group_by(wave, measure) %>% 
  summarise(
    mean = mean(Value, na.rm = T),
    sd = sd(Value, na.rm = T)
  ) %>% 
  mutate(
    across(
      mean:sd,
      ~ round(.x, digits = 1)
    )
  )

ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = Value,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  geom_line(
    data = dat_summary,
    aes(
      y = mean
    ),
    size = 0.5
  ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -0.5,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -0.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=c("#E69F00", "#000000")) +
  scale_fill_manual(values = c("#E69F00", "#000000")) +
  xlab("Wave") +
  theme(
    axis.title.y = element_blank(),
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure5

figure5

ggsave(
  here("figures", "figure5.png"),
  plot = figure5,
  width = 21 * 0.8,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)
```