-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathlab06.qmd
639 lines (455 loc) · 25.7 KB
/
lab06.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
# Lab 6: Applying Functional Programming with Purrr to Models {.unnumbered}
## Package(s)
- [broom](https://broom.tidymodels.org)
- [purrr](https://purrr.tidyverse.org)
## Schedule
- 08.00 - 08.45: [Recap of Lab 5](https://raw.githack.com/r4bds/r4bds.github.io/main/recap_lab06.html)
- 08.45 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab06.html)
- 09.00 - 09.15: Break
- 09.00 - 12.00: [Exercises](#sec-exercises)
## Learning Materials
_Please prepare the following materials:_
- **Important:** Questionnaire (brief, 5-10 min): [Course Midway Evaluation](https://forms.gle/mTRgyYbPdANUHK4r9)
- Book (Note, this is intentionally 1.ed.): [R4DS: Chapter 22: Introduction](https://r4ds.had.co.nz/model-intro.html)
- Book (Note, this is intentionally 1.ed.): [Chapter 23: Model Basics](https://r4ds.had.co.nz/model-basics.html)
- Book (Note, this is intentionally 1.ed.): [Chapter 24: Model Building](https://r4ds.had.co.nz/model-building.html)
- Book (Note, this is intentionally 1.ed.): [Chapter 25: Many models](https://r4ds.had.co.nz/many-models.html)
- Video: [Broom: Converting Statistical Models to Tidy Data Frames](https://www.youtube.com/watch?v=7VGPUBWGv6g)
- Video: [Alex Hayes | Solving the model representation problem with broom | RStudio (2019)](https://www.youtube.com/watch?v=gLZyvY5_kWQ)
- Video: ["The Joy of Functional Programming (for Data Science)" with Hadley Wickham](https://www.youtube.com/watch?v=bzUmK0Y07ck)
- Optional: If you are completely new to statistical modelling, then [click here for a primer](primer_on_linear_models_in_r.qmd)
_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_
## Learning Objectives
*A student who has met the objectives of the session will be able to:*
- Fit a simple linear model and interpret model parameters
- Understand and apply simple purrr-functions for element-wise function application
- Understand and apply grouped supervised models to form nested model objects
- Understand and apply the broom-functions for tidying various model objects
- Optional LO: Perform a basic principal component analysis for dimension reduction of high dimensional data
- Optional LO: Perform a basic unsupervised k-means clustering of high dimensional data
## Exercises {#sec-exercises}
## Throwback...
Using the `tibble()` function, re-create the data from this visualisation and then create your version of how this data could be visualised in a more informative manner:
![](images/bad_viz_15.png){fig-align="center" width=90%}
Also... If you still need to be convinced of the flexibility of ggplot, try running this code:
```{r}
#| echo: true
#| eval: false
xy <- seq(from = -3,
to = 3,
by = 0.01)
expand.grid(x = xy,
y = xy) |>
ggplot(
mapping = aes(
x = (1 - x - sin(y^2)),
y = (1 + y - cos(x^2)))) +
geom_point(alpha = 0.05,
shape = 20,
size = 0) +
theme_void() +
coord_polar()
```
If you are curious about what is going on here, try googling "Generative art"... Anyhoo... Let us move on...
## Prologue
```{r}
#| echo: false
#| eval: true
#| message: false
library("tidyverse")
library("broom")
```
So far, we have worked on
- Lab 2: Genomics, the `gravier` data from @Gravier2010.
- Lab 3: Metagenomics, the `pitlatrine` data from @Torondel2016.
- Lab 4: Clinical data, the `diabetes` data from @Willems1997 and @Schorling1997.
- Lab 5: High throughput immunoinformatics data, the `SARS-CoV-2` data from @Nolan2020.
### Getting Started
_First, make sure to read and discuss the feedback you got from last week's assignment!_
Then, once again go to the [R for Bio Data Science RStudio Cloud Server](https://teaching.healthtech.dtu.dk/22100/rstudio.php)
Create a **new** Quarto document and remember to save it
Today, we will re-examine the data acquired in Lab 2 to further explore the research question:
- **What genes are significantly up- or down-regulated between the patients with/without early metastasis?"**
## Load Data
**Before we do this, go to the console and enter the command: `rm(list = ls())`**
_This will clear your environment from old object from previous labs_
Good now that we have a clean slate - Recall, so far we have adjusted the chunk setting to `#| eval: true` first time we get some data from an online repository and then subsequently set `#| eval: false`.
Let's do that in a bit nicer way.
- **T1: Create a new "Load Data" header in your document and add the below chunk:**
```{r}
#| eval: true
#| echo: true
raw_dir <- "data/_raw/"
data_file <- "gravier.RData"
data_loc <- "https://github.com/ramhiser/datamicroarray/raw/master/data/"
if( !dir.exists(raw_dir) ){
dir.create(path = raw_dir)
}
if( !file.exists(str_c(raw_dir, data_file)) ){
download.file(
url = str_c(data_loc, data_file),
destfile = str_c(raw_dir, data_file))
}
load(file = str_c(raw_dir, data_file))
```
- **Q1: In your group, discuss, what is going on here? Make sure you follow the difference between the first time you run this and the second!**
<details><p><summary>Click here for hint</summary></p>
We have used the `str_c()` function before, try running e.g.:
```{r}
#| echo: true
#| eval: false
x <- "a"
y <- "b"
str_c(x, y)
```
For the functions `dir.exists()` and `file.exists()`, the hint is in the title, they return logicals. To get a better understanding of this, try running e.g.:
```{r}
#| echo: true
#| eval: false
x <- 2
x == 2
if( x == 2 ){
print("Yes!")
}
```
Then change `x == 2` to `x != 2`, re-run the code and see what happens
</details>
## Clean Data
The next step is to clean up the data:
1. Use the `ls()` function to see what objects you have in your environment
1. Use the `str()` function on the gravier data you retrieved to answer:
- **Q2: Discuss in your group if this is tidy data?**
- **T2: Create a new "Clean Data" header in your document and add the below chunk:**
```{r}
#| echo: true
#| eval: true
gravier_clean <- gravier |>
bind_cols() |>
as_tibble()
```
- **Q3: Discuss in your group if this is tidy data?**
- **Q4: In fact, specifically state what are the "rules" for tidy data?**
- **Q5: In your group, discuss why `bind_cols` can by _very very_ dangerous to use?**
Now, moving on, let's write the clean data to disk:
- **T3: In your "Clean Data"-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called "02_gravier_clean" (with the correct file type specification) into your "data"-folder**
<details><p><summary>Click here for hint</summary></p>
*Just as you can `read` a file, you can of course also `write` a file. Note the file type we want to write here is tab-separated-values. If you in the console type e.g. `readr::wr` and then hit the Tab key, you will see the different functions for writing different file types. We previously did a trick to automatically gzip (compress) files?*
</details>
## Augment Data
- **T4: Create a new "Augment Data" header in your document and add the below chunk:**
```{r}
gravier_clean_aug <- gravier_clean |>
mutate(y = case_when(y == "poor" ~ 1,
y == "good" ~ 0)) |>
relocate(early_metastasis = y)
```
- **Q6: In your group, discuss, what each step of the above work flow does, i.e. what are the specifics of the `dplyr` pipeline?**
- **T5: In your "Augment Data"-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called "03_gravier_clean_aug" (with the correct file type specification) into your "data" folder**
## Analysis
### One Gene, one model
- **T6: Create a new "Analysis" header in your document**
Recall, in the second lab, we were looking at "our favourite gene". In the following either look back to what was your favourite gene or choose a new 🤷️
Let's fit our first model! If the concept of models and linear regression is unfamiliar, consider checking out the [primer on linear models in R](primer_on_linear_models_in_r.qmd) before proceeding.
- **T7: Use the `lm()` function to create your first model and save it to a new variable e.g. "my_first_model"**
<details><p><summary>Click here for hint</summary></p>
*Use the formula `my_favourite_gene ~ early_metastasis` and remember when you pipe into the `lm()` function, you have to specify `data = _`*
</details>
- **Q7: What are your coefficients? Mine are:**
```{r}
#| eval: true
#| echo: false
my_first_model <- gravier_clean_aug |>
lm(formula = g2E09 ~ early_metastasis,
data = _)
my_first_model |> pluck("coefficients")
```
- **T8: Use the `group_by()` $\rightarrow$ `summarise()` workflow to calculate the mean values of the gene expression for your favourite gene stratified on `early_metastasis`**
- **Q8: What are your mean values? Mine are:**
```{r}
#| eval: true
#| echo: false
gravier_clean_aug |>
group_by(early_metastasis) |>
summarise(mu = mean(g2E09))
```
- **Q9: Discuss in your group: How are your coefficients related to your mean expression values?**
<details><p><summary>Click here for hint</summary></p>
*Recall, we have two terms here, `intercept` and `slope`. The intercept is the `y` value at `x = 0` and the `slope` is the change in `y` value, for one unit change in `x`*
</details>
- **Q10: Discuss in your group: Is your gene up- or down-regulated from `early_metastasis = 0` to `early_metastasis = 1` and use the `summary()` function to see if is it statistically significant at a level of significance of $\alpha = 0.05$?**
<details><p><summary>Click here for hint</summary></p>
*Try to run `my_first_model |> summary()` and look at the estimate for `early_metastasis`, i.e. the slope and also see if you can find the `p-value` somewhere in this summary...*
</details>
Excellent! Now we have a feeling for working with the `lm()` functions and a basic understanding of the meaning of the coefficients, when we are using linear regression to model a binary outcome. The reason that we use a linear regression model in this case is, that for these exercises, we want to investigate the relationship between variables rather than obtaining probability predictions, i.e.
- **What genes are significantly up-/down-regulated between the patients with- and without early metastasis?**
*So, without further ado, let's dive in!*
### All the Genes, all the models
First, the recent couple of years have seen an immense development in unifying the modelling interface in `R`, which is notoriously inconsistent. You may be familiar with the `caret` package, the developer of which has created [tidymodels](https://www.tidymodels.org), (which I really wish we had time to explore in details). In the following we will work with some of the principles for tidying model object using `broom`, having object nested in tibbles and working with these using `purrr`.
Now, you saw above how we could fit *one* model for *one* gene. So, we could repeat the procedure you worked through for each gene, but first consider:
- **Q11: How many genes are there in the gravier data set?**
Now, we just have to make one model for each gene!
![](images/statistical_models_everywhere.png){fig-align="center" width=90%}
Honestly, **let's not!** Also, recall: *"We don't loop, we func!"*, so let's see how that would work.
### Models, models everywhere...
In principle, you could overflow your environment with individual model objects, but that would require **a lot** of code-lines and **a lot** of hard-coding. But let us instead see if we can come up with something just a tad more clever.
#### Preparing the Data
First:
- **Q12: Discuss in your group, if the `gravier_clean_aug` is a "wide" or a "long" dataset?**
Once you have agreed upon and understood why, this is a *wide* data set, proceed and:
- **T9: Create this long version of your `gravier_clean_aug` data and save it in `gravier_clean_aug_long`**
```{r}
#| eval: true
#| echo: false
gravier_clean_aug_long <- gravier_clean_aug |>
pivot_longer(cols = starts_with("g"),
names_to = "gene",
values_to = "log2_expr_level")
gravier_clean_aug_long
```
<details><p><summary>Click here for hint</summary></p>
*Recall which function took a dataset from "wide" to "long" format and also the three parameters of interest for that function is `cols`, `names_to` and `values_to`. If you look at my example, you will see that the columns we have pivoted have types `<chr>` and `<dbl>`, these will map to the `names_to` and `values_to` parameters respectively. The `cols` can be used with the so-called "helper-functions", that we have previously talked about. If you look at the gene-columns in the data, all the genes starts with a "g", so we can use the helper functions `starts_with()` and then give the argument "g" to the `match` parameter*
</details>
The reason we want the "long" version of the data set is, that now we have ALL the genes defined in ONE `gene` column. This means that we can use `group_by()` to work *per* gene without looping (*Recall: Don't loop, only func!*).
Now:
- **T10: Create a `dplyr` pipeline, use the `group_by()` function to group your `gravier_clean_aug_long` dataset by `gene` and then add the `nest()` and `ungroup()` functions to your pipeline**
```{r}
#| eval: true
#| echo: false
gravier_clean_aug_long_nested <- gravier_clean_aug_long |>
group_by(gene) |>
nest() |>
ungroup()
gravier_clean_aug_long_nested
```
Note, this is conceptually a **super-tricky data structure!**
- **Q13: Discuss in your group, what happened to the data?**
<details><p><summary>Click here for hint</summary></p>
*Try to run each of the following code chunks and examine what happens at each step:*
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested
```
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
filter(gene == "g2E09") # Replace "g2E09" with whatever was YOUR favourite gene!
```
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(data)
```
</details>
- **Q14: Moreover, discuss in your group, what does `<tibble [168 × 2]>` mean?**
Now, **if** you're experiencing, that the server seems slow, consider if you want to proceed now with ALL the genes or just a subset. If you just want a subset, then use `sample_n()` to randomly select e.g. 100 genes for further analysis. Remember you may want to use the `set.seed()` function to create a reproducible work flow even when sampling.
```{r}
#| eval: false
#| echo: false
set.seed(934485)
gravier_clean_aug_long_nested = gravier_clean_aug_long_nested |>
sample_n(100)
gravier_clean_aug_long_nested
```
#### Fitting Models
Now, recall our research question:
- **What genes are significantly up-/down-regulated between the patients with- and without early metastasis?**
To investigate this, we want to fit a linear model to each gene, i.e. as you did initially for your favourite gene, we want to do in a clever way *per* gene for ALL genes.
- **T11: Use the `group_by()` function to let `R` know, that we want to work *per* gene**
```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested |>
group_by(gene)
```
- **T12: Then using the `map()`-function, add a new line to your pipeline, where you add a new variable `model_object` to your `gravier_clean_aug_long_nested` dataset, which `R` will compute *per* gene**
<details><p><summary>Click here for hint</summary></p>
*To do this you will need the following syntax and then wrap that inside the relevant tidyverse-function for creating a new variable:*
```{r}
#| echo: true
#| eval: false
model_object = map(.x = data,
.f = ~lm(formula = log2_expr_level ~ early_metastasis,
data = .x))
```
</details>
```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested <- gravier_clean_aug_long_nested |>
group_by(gene) |>
mutate(model_object = map(.x = data,
.f = ~lm(formula = log2_expr_level ~ early_metastasis,
data = .x)))
```
Make sure to **understand** the `map()` function here, it is **completely central** to functional programming with `purrr`:
- We need the `group_by()` to define which variable holds the elements to each of which we want to map
- `model_object` is a **new** variable, we are creating, which will contain the result of our call to the `map()` function
- `.x` to what **existing** (nested) variable are we mapping?
- `.f` which function do we want to map to each element in the **existing** (nested) variable?
- Note that `log2_expr_level` and `early_metastasis` are variables "inside" the nested `data` variable
Note, once again, this is conceptually a _super-tricky data structure_, not only do we have a *per gene* nested tibble, but now we also have a *per gene* nested model object - So please do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(model_object)
```
#### Tidying Models
Excellent! Now we have a *per gene* model. Let us use the `broom` package to extract some information from each of the models. First, to get a better understanding of what is going on when calling the `tidy()` function, try running this:
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
# Here, you should replace "g2E09" with whatever was YOUR favourite gene!
filter(gene == "g2E09") |>
# Pull() on tibbles: This pulls out the model_object variable.
# Note! This is a list, because we nested!
pull(model_object) |>
# Pluck() on lists: From the list we got from the last step,
# we "pluck" the first element
pluck(1) |>
# The result of pluck, is a model object,
# upon which we can call the tidy function
tidy(conf.int = TRUE,
conf.level = 0.95)
```
Now, we want to apply this `tidy()` function *per `model_object`*:
- **T13: Scroll a bit back to where we created the `model_object` and see if you can translate that into mapping the `tidy()` function to the `model_object` variable, thereby creating a new variable `model_object_tidy` - This is tricky, so do make sure to discuss in your group how this can be done!**
<details><p><summary>Click here for hint</summary></p>
*Remember the parameters to the `tidy()` functions, which gives us the confidence intervals and defines corresponding level. Also, everything you need is in the previous layout of the `map()` function. Note, that the calls to the `pull()` and `pluck()` functions above, pertains to **extracting** from a tibble, which we do **not** want, on the contrary, we want **all** objects to be contained in our `gravier_clean_aug_long_nested` data*
</details>
```{r}
#| echo: false
#| eval: true
gravier_clean_aug_long_nested <- gravier_clean_aug_long_nested |>
mutate(model_object_tidy = map(.x = model_object,
.f = ~tidy(x = .x,
conf.int = TRUE,
conf.level = 0.95)))
gravier_clean_aug_long_nested
```
Note, once again, this is conceptually a _super-tricky data structure_, not only do we have a *per gene* nested tibble, but now we also have a *per gene* nested model object and now also a nested tibble of `tidy`ed objects - So please again do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:
```{r}
#| echo: true
#| eval: false
gravier_clean_aug_long_nested |>
filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(model_object_tidy)
```
#### Wrangling
We're almost there - just a bit of wrangling to go!
Just as you saw that we could `nest()` on a variable (recall we did that for the `gene` variable), you can do the opposite and lo and behold, that is the `unnest()` function. Before we continue:
- **T14: Create a `dplyr` pipeline and save the result in a new variable called `gravier_estimates`: Use the `unnest()` function to unpack the `model_object_tidy`**
```{r}
#| echo: false
#| eval: true
gravier_estimates <- gravier_clean_aug_long_nested |>
unnest(model_object_tidy)
gravier_estimates
```
- **T15: The again, create a `dplyr` pipeline and save the result in a the same `gravier_estimates` variable: Subset the rows to only get the slope term and then choose variables as displayed below, finally end with un-grouping your data, as we no longer need the groups**
```{r}
#| echo: false
#| eval: true
gravier_estimates <- gravier_estimates |>
filter(term == "early_metastasis") |>
select(gene, p.value, estimate, conf.low, conf.high) |>
ungroup()
gravier_estimates
```
- **T16: To your `gravier_estimates` dataset, add a variable `q.value`, which is the result of calling the `p.adjust()` function on your `p.value` variable and also add an indicator variable denoting if a given gene is significant or not**
```{r}
#| echo: false
#| eval: true
gravier_estimates <- gravier_estimates |>
mutate(q.value = p.adjust(p.value),
is_significant = case_when(q.value <= 0.05 ~ "yes",
q.value > 0.05 ~ "no"))
gravier_estimates
```
*But... `q.value`??? What are you on about??? For a nice primer on how multiple testing correction works, please read this article by @Noble2009!*
#### Recap
If you (understandably by now) have lost a bit of overview of what is going on, let's just re-iterate.
1. We have the `gravier` dataset, with the log2-expression levels for 2,905 genes of 168 patients of whom 111 did *not* have early metastasis and 57 who did
1. We are interested in investigating what genes are significantly up-/down-regulated between the patients with- and without early metastasis
1. First, we retrieved the data from the data repository, cleaned and augmented it and saved it to disk
1. Then pivoted the data, so we could work *per gene* (The `gene` variable)
1. Next, we grouped *per gene* and nested the data (The `data` variable)
1. Then, we fitted a linear model to each gene (The `model_object` variable)
1. Next, we used the `broom` package to tidy the fitted model incl. getting confidence intervals (The `model_object_tidy` variable)
1. Lastly, we extracted the model parameters, corrected for multiple testing and added and indicator for significant findings
Now, we actually have everything we need to answer:
- **What genes are significantly up- or down-regulated between the patients with/without early metastasis?**
In the following, we will use a level of significance of $\alpha = 0.05$ to provide this answer.
#### Visualise
Right, let's get to it!
- **T17: Re-create this forest plot to finally reveal the results of your analysis** <span style="color: red;">GROUP ASSIGNMENT</span> (Important, see: [how to](assignments.qmd))
```{r}
#| echo: false
#| eval: true
gravier_estimates |>
filter(is_significant == "yes") |>
ggplot(aes(x = estimate,
y = fct_reorder(gene, estimate),
xmin = conf.low,
xmax = conf.high)) +
geom_vline(xintercept = 0) +
geom_errorbarh() +
geom_point() +
theme_minimal() +
theme(plot.title = element_text(hjust = 1)) +
labs(x = "Estimates (95%CIs)",
y = "",
title = "Genes Associated with Early Metastasis in Small Node-Negative Breast Carcinoma",
caption = "Data from DOI: 10.1002/gcc.20820")
```
<details><p><summary>Click here for hint</summary></p>
*Here, you will have to use your indicator variable to identify significant genes **before** plotting and then it would probably be prudent to take a look at the `fct_reorder()` function and `geom_errorbarh()` representation*
</details>
- **T18: Re-create this volcano plot to finally reveal the results of your analysis** <span style="color: red;">GROUP ASSIGNMENT part II</span>
```{r}
#| echo: false
#| eval: true
library("ggrepel")
gravier_estimates |>
mutate(lbl = case_when(is_significant == "yes" ~ gene,
is_significant == "no" ~ "")) |>
ggplot(aes(x = estimate,
y = -log10(p.value),
colour = is_significant,
label = lbl)) +
geom_point(size = 1,
alpha = 0.5) +
geom_text_repel(size = 2,
max.overlaps = 20) +
geom_hline(yintercept = 0) +
theme_minimal() +
theme(plot.title = element_text(hjust = 1),
plot.subtitle = element_text(hjust = 1),
legend.position = "none") +
labs(x = "Estimates",
y = "-log10(p)",
title = "Genes Associated with Early Metastasis in Small Node-Negative Breast Carcinoma",
subtitle = "Genes highlighted in turquoise were significant after multiple testing correction",
caption = "Data from DOI: 10.1002/gcc.20820")
```
<details><p><summary>Click here for hint</summary></p>
*Here,**before** plotting you will have to create a new label-variable which takes value `gene` for significant genes and otherwise is simply an empty string, which we denote by `""`. Also, perhaps the `ggrepel` package would be relevant for somehow adding text/labels*
</details>
## <span style="color: red;">GROUP ASSIGNMENT</span>
For the group assignment this time, you will use **T17** and **T18** to again create a reproducible micro-report and make sure to:
- [Read the assignment instructions](https://r4bds.github.io/assignments.html)
- [Read and follow the code styling guidelines](https://r4bds.github.io/code_styling.html)
## Optional
*The following is only if you can find the time! But, perhaps this would be something interesting to revisit during the project period*
### PCA
- Go and visit this blog post by [Claus O. Wilke, Professor of Integrative Biology](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/).
- Your task is to work through the blog post using the `gravier` dataset to create a PCA analysis
### K-means
- Go and visit this [K-means clustering with tidy data principles](https://www.tidymodels.org/learn/statistics/k-means/) post
- Your task is to work through the blog post using the `gravier` dataset to create a K-means analysis