lab06.qmd

# Lab 6: Applying Functional Programming with Purrr to Models {.unnumbered}

## Package(s)

- [broom](https://broom.tidymodels.org)
- [purrr](https://purrr.tidyverse.org)

## Schedule

- 08.00 - 08.45: [Recap of Lab 5](https://raw.githack.com/r4bds/r4bds.github.io/main/recap_lab06.html)
- 08.45 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab06.html)
- 09.00 - 09.15: Break
- 09.00 - 12.00: [Exercises](#sec-exercises)

## Learning Materials

_Please prepare the following materials:_

- **Important:** Questionnaire (brief, 5-10 min): [Course Midway Evaluation](https://forms.gle/mTRgyYbPdANUHK4r9)
-  Book (Note, this is intentionally 1.ed.): [R4DS: Chapter 22: Introduction](https://r4ds.had.co.nz/model-intro.html)
-  Book (Note, this is intentionally 1.ed.): [Chapter 23: Model Basics](https://r4ds.had.co.nz/model-basics.html)
-  Book (Note, this is intentionally 1.ed.): [Chapter 24: Model Building](https://r4ds.had.co.nz/model-building.html)
-  Book (Note, this is intentionally 1.ed.): [Chapter 25: Many models](https://r4ds.had.co.nz/many-models.html)
- Video: [Broom: Converting Statistical Models to Tidy Data Frames](https://www.youtube.com/watch?v=7VGPUBWGv6g)
- Video: [Alex Hayes | Solving the model representation problem with broom | RStudio (2019)](https://www.youtube.com/watch?v=gLZyvY5_kWQ)
- Video: ["The Joy of Functional Programming (for Data Science)" with Hadley Wickham](https://www.youtube.com/watch?v=bzUmK0Y07ck)
- Optional: If you are completely new to statistical modelling, then [click here for a primer](primer_on_linear_models_in_r.qmd)

_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_

## Learning Objectives

*A student who has met the objectives of the session will be able to:*

- Fit a simple linear model and interpret model parameters
- Understand and apply simple purrr-functions for element-wise function application
- Understand and apply grouped supervised models to form nested model objects
- Understand and apply the broom-functions for tidying various model objects
- Optional LO: Perform a basic principal component analysis for dimension reduction of high dimensional data
- Optional LO: Perform a basic unsupervised k-means clustering of high dimensional data

## Exercises {#sec-exercises}

## Throwback...

Using the `tibble()` function, re-create the data from this visualisation and then create your version of how this data could be visualised in a more informative manner:

![](images/bad_viz_15.png){fig-align="center" width=90%}

Also... If you still need to be convinced of the flexibility of ggplot, try running this code:

```{r}
#| echo: true
#| eval: false
xy <- seq(from = -3,
          to = 3, 
          by = 0.01)
expand.grid(x = xy,
            y = xy) |>
  ggplot(
    mapping = aes(
      x = (1 - x - sin(y^2)),
      y = (1 + y - cos(x^2)))) +
  geom_point(alpha = 0.05,
             shape = 20,
             size = 0) +
  theme_void() +
  coord_polar()
```

If you are curious about what is going on here, try googling "Generative art"... Anyhoo... Let us move on...

## Prologue

```{r}
#| echo: false
#| eval: true
#| message: false

library("tidyverse")
library("broom")
```

So far, we have worked on

- Lab 2: Genomics, the `gravier` data from @Gravier2010.
- Lab 3: Metagenomics, the `pitlatrine` data from @Torondel2016.
- Lab 4: Clinical data, the `diabetes` data from @Willems1997 and @Schorling1997.
- Lab 5: High throughput immunoinformatics data, the `SARS-CoV-2` data from @Nolan2020.

### Getting Started

_First, make sure to read and discuss the feedback you got from last week's assignment!_

Then, once again go to the [R for Bio Data Science RStudio Cloud Server](https://teaching.healthtech.dtu.dk/22100/rstudio.php)

Create a **new** Quarto document and remember to save it

Today, we will re-examine the data acquired in Lab 2 to further explore the research question:

- **What genes are significantly up- or down-regulated between the patients with/without early metastasis?"**

## Load Data

**Before we do this, go to the console and enter the command: `rm(list = ls())`**

_This will clear your environment from old object from previous labs_

Good now that we have a clean slate - Recall, so far we have adjusted the chunk setting to `#| eval: true` first time we get some data from an online repository and then subsequently set `#| eval: false`.

Let's do that in a bit nicer way.

- **T1: Create a new "Load Data" header in your document and add the below chunk:**

```{r}
#| eval: true
#| echo: true

raw_dir <- "data/_raw/"
data_file <- "gravier.RData"
data_loc <- "https://github.com/ramhiser/datamicroarray/raw/master/data/"

if( !dir.exists(raw_dir) ){
  dir.create(path = raw_dir)
}
if( !file.exists(str_c(raw_dir, data_file)) ){
  download.file(
    url = str_c(data_loc, data_file),
    destfile = str_c(raw_dir, data_file))
}
load(file = str_c(raw_dir, data_file))
```

- **Q1: In your group, discuss, what is going on here? Make sure you follow the difference between the first time you run this and the second!**

<details><p><summary>Click here for hint</summary></p>
We have used the `str_c()` function before, try running e.g.:

```{r}
#| echo: true
#| eval: false

x <- "a"
y <- "b"
str_c(x, y)
```

For the functions `dir.exists()` and `file.exists()`, the hint is in the title, they return logicals. To get a better understanding of this, try running e.g.:

```{r}
#| echo: true
#| eval: false

x <- 2
x == 2
if( x == 2 ){
  print("Yes!")
}
```

Then change `x == 2` to `x != 2`, re-run the code and see what happens

</details>


## Clean Data

The next step is to clean up the data:

1. Use the `ls()` function to see what objects you have in your environment
1. Use the `str()` function on the gravier data you retrieved to answer:

- **Q2: Discuss in your group if this is tidy data?**

- **T2: Create a new "Clean Data" header in your document and add the below chunk:**

```{r}
#| echo: true
#| eval: true

gravier_clean <- gravier |>
  bind_cols() |>
  as_tibble()
```

- **Q3: Discuss in your group if this is tidy data?**

- **Q4: In fact, specifically state what are the "rules" for tidy data?**

- **Q5: In your group, discuss why `bind_cols` can by _very very_ dangerous to use?**

Now, moving on, let's write the clean data to disk:

- **T3: In your "Clean Data"-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called "02_gravier_clean" (with the correct file type specification) into your "data"-folder**

<details><p><summary>Click here for hint</summary></p>
*Just as you can `read` a file, you can of course also `write` a file. Note the file type we want to write here is tab-separated-values. If you in the console type e.g. `readr::wr` and then hit the Tab key, you will see the different functions for writing different file types. We previously did a trick to automatically gzip (compress) files?*
</details>

## Augment Data

- **T4: Create a new "Augment Data" header in your document and add the below chunk:**

```{r}
gravier_clean_aug <- gravier_clean |>
  mutate(y = case_when(y == "poor" ~ 1,
                       y == "good" ~ 0)) |> 
  relocate(early_metastasis = y)
```

- **Q6: In your group, discuss, what each step of the above work flow does, i.e. what are the specifics of the `dplyr` pipeline?**

- **T5: In your "Augment Data"-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called "03_gravier_clean_aug" (with the correct file type specification) into your "data" folder**

## Analysis

### One Gene, one model

- **T6: Create a new "Analysis" header in your document**

Recall, in the second lab, we were looking at "our favourite gene". In the following either look back to what was your favourite gene or choose a new 🤷️

Let's fit our first model! If the concept of models and linear regression is unfamiliar, consider checking out the [primer on linear models in R](primer_on_linear_models_in_r.qmd) before proceeding.

- **T7: Use the `lm()` function to create your first model and save it to a new variable e.g. "my_first_model"**

<details><p><summary>Click here for hint</summary></p>
*Use the formula `my_favourite_gene ~ early_metastasis` and remember when you pipe into the `lm()` function, you have to specify `data = _`*
</details>

- **Q7: What are your coefficients? Mine are:**

```{r}
#| eval: true
#| echo: false
my_first_model <- gravier_clean_aug |> 
  lm(formula = g2E09 ~ early_metastasis,
     data = _)
my_first_model |> pluck("coefficients")
```

- **T8: Use the `group_by()` $\rightarrow$ `summarise()` workflow to calculate the mean values of the gene expression for your favourite gene stratified on `early_metastasis`**

- **Q8: What are your mean values? Mine are:**

```{r}
#| eval: true
#| echo: false
gravier_clean_aug |>
  group_by(early_metastasis) |>
  summarise(mu = mean(g2E09))
```

- **Q9: Discuss in your group: How are your coefficients related to your mean expression values?**

<details><p><summary>Click here for hint</summary></p>
*Recall, we have two terms here, `intercept` and `slope`. The intercept is the `y` value at `x = 0` and the `slope` is the change in `y` value, for one unit change in `x`*
</details>

- **Q10: Discuss in your group: Is your gene up- or down-regulated from `early_metastasis = 0` to `early_metastasis = 1` and use the `summary()` function to see if is it statistically significant at a level of significance of $\alpha = 0.05$?**

<details><p><summary>Click here for hint</summary></p>
*Try to run `my_first_model |> summary()` and look at the estimate for `early_metastasis`, i.e. the slope and also see if you can find the `p-value` somewhere in this summary...*
</details>

Excellent! Now we have a feeling for working with the `lm()` functions and a basic understanding of the meaning of the coefficients, when we are using linear regression to model a binary outcome. The reason that we use a linear regression model in this case is, that for these exercises, we want to investigate the relationship between variables rather than obtaining probability predictions, i.e.

- **What genes are significantly up-/down-regulated between the patients with- and without early metastasis?**

*So, without further ado, let's dive in!*

### All the Genes, all the models

First, the recent couple of years have seen an immense development in unifying the modelling interface in `R`, which is notoriously inconsistent. You may be familiar with the `caret` package, the developer of which has created [tidymodels](https://www.tidymodels.org), (which I really wish we had time to explore in details). In the following we will work with some of the principles for tidying model object using `broom`, having object nested in tibbles and working with these using `purrr`.

Now, you saw above how we could fit *one* model for *one* gene. So, we could repeat the procedure you worked through for each gene, but first consider:

- **Q11: How many genes are there in the gravier data set?**

Now, we just have to make one model for each gene!

![](images/statistical_models_everywhere.png){fig-align="center" width=90%}

Honestly, **let's not!** Also, recall: *"We don't loop, we func!"*, so let's see how that would work.

### Models, models everywhere...

In principle, you could overflow your environment with individual model objects, but that would require **a lot** of code-lines and **a lot** of hard-coding. But let us instead see if we can come up with something just a tad more clever.

#### Preparing the Data

First:

- **Q12: Discuss in your group, if the `gravier_clean_aug` is a "wide" or a "long" dataset?**

Once you have agreed upon and understood why, this is a *wide* data set, proceed and:

- **T9: Create this long version of your `gravier_clean_aug` data and save it in `gravier_clean_aug_long`**

```{r}
#| eval: true
#| echo: false

gravier_clean_aug_long <- gravier_clean_aug |>
  pivot_longer(cols = starts_with("g"),
               names_to = "gene",
               values_to = "log2_expr_level")
gravier_clean_aug_long
```

<details><p><summary>Click here for hint</summary></p>
*Recall which function took a dataset from "wide" to "long" format and also the three parameters of interest for that function is `cols`, `names_to` and `values_to`. If you look at my example, you will see that the columns we have pivoted have types `<chr>` and `<dbl>`, these will map to the `names_to` and `values_to` parameters respectively. The `cols` can be used with the so-called "helper-functions", that we have previously talked about. If you look at the gene-columns in the data, all the genes starts with a "g", so we can use the helper functions `starts_with()` and then give the argument "g" to the `match` parameter*
</details>

The reason we want the "long" version of the data set is, that now we have ALL the genes defined in ONE `gene` column. This means that we can use `group_by()` to work *per* gene without looping (*Recall: Don't loop, only func!*).

Now:

- **T10: Create a `dplyr` pipeline, use the `group_by()` function to group your `gravier_clean_aug_long` dataset by `gene` and then add the `nest()` and `ungroup()` functions to your pipeline**

```{r}
#| eval: true
#| echo: false
gravier_clean_aug_long_nested <- gravier_clean_aug_long |>
  group_by(gene) |>
  nest() |> 
  ungroup()
gravier_clean_aug_long_nested
```

Note, this is conceptually a **super-tricky data structure!**

- **Q13: Discuss in your group, what happened to the data?**

<details><p><summary>Click here for hint</summary></p>
*Try to run each of the following code chunks and examine what happens at each step:*

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested
```

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
  filter(gene == "g2E09") # Replace "g2E09" with whatever was YOUR favourite gene!
```

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
  filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
  pull(data)
```

</details>

- **Q14: Moreover, discuss in your group, what does `<tibble [168 × 2]>` mean?**

Now, **if** you're experiencing, that the server seems slow, consider if you want to proceed now with ALL the genes or just a subset. If you just want a subset, then use `sample_n()` to randomly select e.g. 100 genes for further analysis. Remember you may want to use the `set.seed()` function to create a reproducible work flow even when sampling.

```{r}
#| eval: false
#| echo: false
set.seed(934485)
gravier_clean_aug_long_nested = gravier_clean_aug_long_nested |>
  sample_n(100)
gravier_clean_aug_long_nested
```

#### Fitting Models

Now, recall our research question:

- **What genes are significantly up-/down-regulated between the patients with- and without early metastasis?**

To investigate this, we want to fit a linear model to each gene, i.e. as you did initially for your favourite gene, we want to do in a clever way *per* gene for ALL genes.

- **T11: Use the `group_by()` function to let `R` know, that we want to work *per* gene**

```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested |> 
  group_by(gene)
```

- **T12: Then using the `map()`-function, add a new line to your pipeline, where you add a new variable `model_object` to your `gravier_clean_aug_long_nested` dataset, which `R` will compute *per* gene**

<details><p><summary>Click here for hint</summary></p>

*To do this you will need the following syntax and then wrap that inside the relevant tidyverse-function for creating a new variable:*

```{r}
#| echo: true
#| eval: false
model_object = map(.x = data,
                   .f = ~lm(formula = log2_expr_level ~ early_metastasis,
                            data = .x))
```

</details>

```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested <- gravier_clean_aug_long_nested |> 
  group_by(gene) |> 
  mutate(model_object = map(.x = data,
                            .f = ~lm(formula = log2_expr_level ~ early_metastasis,
                                     data = .x)))
```

Make sure to **understand** the `map()` function here, it is **completely central** to functional programming with `purrr`:

- We need the `group_by()` to define which variable holds the elements to each of which we want to map
- `model_object` is a **new** variable, we are creating, which will contain the result of our call to the `map()` function
- `.x` to what **existing** (nested) variable are we mapping?
- `.f` which function do we want to map to each element in the **existing** (nested) variable?
- Note that `log2_expr_level` and `early_metastasis` are variables "inside" the nested `data` variable

Note, once again, this is conceptually a _super-tricky data structure_, not only do we have a *per gene* nested tibble, but now we also have a *per gene* nested model object - So please do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
  filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
  pull(model_object)
```

#### Tidying Models

Excellent! Now we have a *per gene* model. Let us use the `broom` package to extract some information from each of the models. First, to get a better understanding of what is going on when calling the `tidy()` function, try running this:

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
  
  # Here, you should replace "g2E09" with whatever was YOUR favourite gene!
  filter(gene == "g2E09") |> 
  
  # Pull() on tibbles: This pulls out the model_object variable.
  #   Note! This is a list, because we nested!
  pull(model_object) |> 
  
  # Pluck() on lists: From the list we got from the last step,
  #   we "pluck" the first element
  pluck(1) |>
  
  # The result of pluck, is a model object,
  #   upon which we can call the tidy function
  tidy(conf.int = TRUE,
       conf.level = 0.95)
```

Now, we want to apply this `tidy()` function *per `model_object`*:

- **T13: Scroll a bit back to where we created the `model_object` and see if you can translate that into mapping the `tidy()` function to the `model_object` variable, thereby creating a new variable `model_object_tidy` - This is tricky, so do make sure to discuss in your group how this can be done!**

<details><p><summary>Click here for hint</summary></p>
*Remember the parameters to the `tidy()` functions, which gives us the confidence intervals and defines corresponding level. Also, everything you need is in the previous layout of the `map()` function. Note, that the calls to the `pull()` and `pluck()` functions above, pertains to **extracting** from a tibble, which we do **not** want, on the contrary, we want **all** objects to be contained in our `gravier_clean_aug_long_nested` data*
</details>


```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested <- gravier_clean_aug_long_nested |> 
  mutate(model_object_tidy = map(.x = model_object,
                                 .f = ~tidy(x = .x,
                                            conf.int = TRUE,
                                            conf.level = 0.95)))
gravier_clean_aug_long_nested
```

Note, once again, this is conceptually a _super-tricky data structure_, not only do we have a *per gene* nested tibble, but now we also have a *per gene* nested model object and now also a nested tibble of `tidy`ed objects - So please again do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:

```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
  filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
  pull(model_object_tidy)
```

#### Wrangling

We're almost there - just a bit of wrangling to go!

Just as you saw that we could `nest()` on a variable (recall we did that for the `gene` variable), you can do the opposite and lo and behold, that is the `unnest()` function. Before we continue:

- **T14: Create a `dplyr` pipeline and save the result in a new variable called `gravier_estimates`: Use the `unnest()` function to unpack the `model_object_tidy`**

```{r}
#| echo: false
#| eval: true

gravier_estimates <- gravier_clean_aug_long_nested |>
  unnest(model_object_tidy)
gravier_estimates
```

- **T15: The again, create a `dplyr` pipeline and save the result in a the same `gravier_estimates` variable: Subset the rows to only get the slope term and then choose variables as displayed below, finally end with un-grouping your data, as we no longer need the groups**

```{r}
#| echo: false
#| eval: true

gravier_estimates <- gravier_estimates |> 
  filter(term == "early_metastasis") |> 
  select(gene, p.value, estimate, conf.low, conf.high) |> 
  ungroup()
gravier_estimates
```

- **T16: To your `gravier_estimates` dataset, add a variable `q.value`, which is the result of calling the `p.adjust()` function on your `p.value` variable and also add an indicator variable denoting if a given gene is significant or not**

```{r}
#| echo: false
#| eval: true

gravier_estimates <- gravier_estimates |>
  mutate(q.value = p.adjust(p.value),
         is_significant = case_when(q.value <= 0.05 ~ "yes",
                                    q.value > 0.05 ~ "no"))
gravier_estimates
```

*But... `q.value`??? What are you on about??? For a nice primer on how multiple testing correction works, please read this article by @Noble2009!*

#### Recap

If you (understandably by now) have lost a bit of overview of what is going on, let's just re-iterate.

1. We have the `gravier` dataset, with the log2-expression levels for 2,905 genes of 168 patients of whom 111 did *not* have early metastasis and 57 who did
1. We are interested in investigating what genes are significantly up-/down-regulated between the patients with- and without early metastasis
1. First, we retrieved the data from the data repository, cleaned and augmented it and saved it to disk
1. Then pivoted the data, so we could work *per gene* (The `gene` variable)
1. Next, we grouped *per gene* and nested the data (The `data` variable)
1. Then, we fitted a linear model to each gene (The `model_object` variable)
1. Next, we used the `broom` package to tidy the fitted model incl. getting confidence intervals (The `model_object_tidy` variable)
1. Lastly, we extracted the model parameters, corrected for multiple testing and added and indicator for significant findings

Now, we actually have everything we need to answer:

- **What genes are significantly up- or down-regulated between the patients with/without early metastasis?**

In the following, we will use a level of significance of $\alpha = 0.05$ to provide this answer.

#### Visualise

Right, let's get to it!

- **T17: Re-create this forest plot to finally reveal the results of your analysis** <span style="color: red;">GROUP ASSIGNMENT</span> (Important, see: [how to](assignments.qmd))

```{r}
#| echo: false
#| eval: true

gravier_estimates |> 
  filter(is_significant == "yes") |> 
  ggplot(aes(x = estimate,
             y = fct_reorder(gene, estimate),
             xmin = conf.low,
             xmax = conf.high)) +
  geom_vline(xintercept = 0) +
  geom_errorbarh() +
  geom_point() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 1)) +
  labs(x = "Estimates (95%CIs)",
       y = "",
       title = "Genes Associated with Early Metastasis in Small Node-Negative Breast Carcinoma",
       caption = "Data from DOI: 10.1002/gcc.20820")
```

<details><p><summary>Click here for hint</summary></p>
*Here, you will have to use your indicator variable to identify significant genes **before** plotting and then it would probably be prudent to take a look at the `fct_reorder()` function and `geom_errorbarh()` representation*
</details>

- **T18: Re-create this volcano plot to finally reveal the results of your analysis** <span style="color: red;">GROUP ASSIGNMENT part II</span>

```{r}
#| echo: false
#| eval: true

library("ggrepel")
gravier_estimates |>
  mutate(lbl = case_when(is_significant == "yes" ~ gene,
                         is_significant == "no" ~ "")) |> 
  ggplot(aes(x = estimate,
             y = -log10(p.value),
             colour = is_significant,
             label = lbl)) +
  geom_point(size = 1,
             alpha = 0.5) +
  geom_text_repel(size = 2,
                  max.overlaps = 20) +
  geom_hline(yintercept = 0) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 1),
        plot.subtitle = element_text(hjust = 1),
        legend.position = "none") +
  labs(x = "Estimates",
       y = "-log10(p)",
       title = "Genes Associated with Early Metastasis in Small Node-Negative Breast Carcinoma",
       subtitle = "Genes highlighted in turquoise were significant after multiple testing correction",
       caption = "Data from DOI: 10.1002/gcc.20820")
```

<details><p><summary>Click here for hint</summary></p>
*Here,**before** plotting you will have to create a new label-variable which takes value `gene` for significant genes and otherwise is simply an empty string, which we denote by `""`. Also, perhaps the `ggrepel` package would be relevant for somehow adding text/labels*
</details>

## <span style="color: red;">GROUP ASSIGNMENT</span>

For the group assignment this time, you will use **T17** and **T18** to again create a reproducible micro-report and make sure to:

- [Read the assignment instructions](https://r4bds.github.io/assignments.html)
- [Read and follow the code styling guidelines](https://r4bds.github.io/code_styling.html)

## Optional

*The following is only if you can find the time! But, perhaps this would be something interesting to revisit during the project period*

### PCA

- Go and visit this blog post by [Claus O. Wilke, Professor of Integrative Biology](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/).
- Your task is to work through the blog post using the `gravier` dataset to create a PCA analysis

### K-means

- Go and visit this [K-means clustering with tidy data principles](https://www.tidymodels.org/learn/statistics/k-means/) post
- Your task is to work through the blog post using the `gravier` dataset to create a K-means analysis