-
Notifications
You must be signed in to change notification settings - Fork 5
/
lecture_lab01.qmd
626 lines (402 loc) · 18.2 KB
/
lecture_lab01.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
---
title: "Lecture Lab 1"
author: "Leon Eyrich Jessen"
format:
revealjs:
embed-resources: true
theme: moon
slide-number: c/t
width: 1600
height: 900
mainfont: avenir
logo: images/r4bds_logo_small.png
footer: "R for Bio Data Science"
---
# Course Introduction
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## DATADATADATA: Data Hoarding
![](images/data_hoarding.png){fig-align="center" width=80%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Increasing the Value of Data Requires Activation!
![](images/increasing_the_value_of_data.png){fig-align="center" width=80%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Data Driven Decision Making
_Because we've always done it this way!_
:::: {.columns}
::: {.column width="40%"}
Your job as a Bio Data Scientist:
- Activate data
- Extract insights
- Communicate to non-data stakeholders
- Facilitate data driven decision making
:::
::: {.column width="60%"}
![](images/data_driven_decision_making.jpg){fig-align="center" width=80%}
:::
::::
_Levering data driven decision making allows the company to gain a competitive edge and this is where you Bio Data Science skills are indispensable!_
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## You value as a Bio Data Scientist / Bioinformatician
:::: {.columns}
::: {.column width="50%"}
- In your career, your task will be to create value
- This is regardless of whether you plan to work in indstry or pursue a research career
- What you do has to create value
- Creating value requires skills
- Skills need to be learned
- So, why are you here?
:::
::: {.column width="50%"}
![](images/question_mark.png){fig-align="center" width=100%}
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
# You're here to gain skills, which will allow you to generate value!
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## R for Bio Data Science - What is it?
![](images/bio_data_science.png){fig-align="center" width=80%}
- In essence: The art of converting numbers to value
- Ingest data
- Transform, wrangle, visualise, model
- Output insights
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## R for Bio Data Science - Intrinsically interdisciplinary
![](images/r_for_bio_data_science_hex_logo_quadratic.png){fig-align="center" width=100%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Why "Bio" in R for Bio Data Science?
![](images/domain_knowledge.png){fig-align="center" width=80%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## What is the motivation for this Course?
![](images/data_tutorial_vs_reality.jpg){fig-align="center" width=80%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## What is the motivation for this Course?
![](images/code_now_only_god.png){fig-align="center" width=70%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## What will you learn?
- The craft of going from raw extracted data to insights
- Advanced data visualisation
- Collaborative project oriented coding
- All with an emphasis on reproducibility and communication
![](images/data_science_cycle.png){fig-align="center" width=80%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
# R
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Introducing R: A Journey into Bio Data Science
:::: {.columns}
::: {.column width="40%"}
- Open-source programming language
- Essential tool for statistics & data visualization
- Widely used in bioinformatics and data science
- Dynamic community & vast library of packages
:::
::: {.column width="60%"}
![](images/R_logo.png){fig-align="center" width=80%}
:::
::::
_"To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call." – John Chambers (creator of the S language, of which R is an implementation)._
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## The Roots and Rise of R
:::: {.columns}
::: {.column width="40%"}
- Originated from the 'S' language at Bell Laboratories in the 1970s
- S was proprietary, so basically R is an open source implementation of S and was officially released in 1995
- This similar to Linux vs. Unix
- A leader in statistical computing. Powers many academic research & industry projects
- E.g. Crucial in genomics, where R aids in decoding biological data
- R comes with a very large and well proven built in tools for data analysis
:::
::: {.column width="60%"}
![](images/R_logo.png){fig-align="center" width=80%}
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## A Few Examples of Functional Programming
:::: {.columns}
::: {.column width="50%"}
You can approach `R` as
- an object-oriented programming language
:::
::: {.column width="50%"}
Let's say we have this vector
```{r}
#| echo: true
my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)
```
Now, we want to compute the mean, we can do:
Object Oriented Approach:
```{r}
#| echo: true
Vector <- R6::R6Class("Vector",
public = list(
data = NULL,
initialize = function(data) {
if (!is.numeric(data)) {
stop("Data should be numeric.")
}
self$data <- data
},
mean = function() {
return(sum(self$data) / length(self$data))
}
)
)
numbers <- Vector$new(my_vector)
print(numbers$mean())
```
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## A Few Examples of Functional Programming
:::: {.columns}
::: {.column width="50%"}
You can approach `R` as
- an object-oriented programming language
- a imperative programming language
:::
::: {.column width="50%"}
Let's say we have this vector
```{r}
#| echo: true
my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)
```
Now, we want to compute the mean, we can do:
Imperative Approach:
```{r}
#| echo: true
my_sum <- 0
for( i in 1:length(my_vector) ){
my_sum <- my_sum + my_vector[i]
}
my_mean <- my_sum / length(my_vector)
print(my_mean)
```
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## A Few Examples of Functional Programming
:::: {.columns}
::: {.column width="50%"}
You can approach `R` as
- an object-oriented programming language
- a imperative programming language
- a functional programming language
The code on the right all performs the same task, but which do you think is:
- simpler to read and understand?
- faster to execute?
In this course we will work with `R` in its native form - a fully fledged functional programming language
:::
::: {.column width="50%"}
Let's say we have this vector
```{r}
#| echo: true
my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)
```
Now, we want to compute the mean, we can do:
Functional approach:
```{r}
#| echo: true
my_mean <- mean(my_vector)
print(my_mean)
```
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## You simply call functions on objects
Standard Deviation
```{r}
#| echo: true
sd(my_vector)
```
Median
```{r}
#| echo: true
median(my_vector)
```
Permute
```{r}
#| echo: true
sample(my_vector)
```
Bootstrap
```{r}
#| echo: true
sample(my_vector, replace = TRUE)
```
...and tons more!
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## "R is not a real programming language": Debunking Myths I
1. `R` is Turing-complete:
- `R` can theoretically solve any computational problem. Foundational concept shared with e.g. Python, C++, Java, etc.
2. R is fully capable in Production:
- E.g. shiny apps used in industry and `R` comes with an ecosystem supporting reproducibility in production settings.
3. Comprehensive Ecosystem:
- CRAN contains ~20,000 packages. Also Bioconductor is a gold standard for bioinformatics software.
4. Interoperability:
- Seamless integration with other languages (C, C++, Fortran, and Python) using packages like Rcpp and reticulate.
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## "R is not a real programming language": Debunking Myths II
5. Advanced Programming Features:
- Supports object-oriented, functional, and imperative programming paradigms. Flexible metaprogramming with capabilities like non-standard evaluation
6. R's Active & Growing Community:
- Annual global R conferences and numerous local user groups and also: `tidyverse`
7. Performance:
- `R` is interpreted and can be slower, packages like data.table and Rcpp offer dramatic performance enhancements. Also, parallel computing is straightforward
8. Not Just for Statisticians:
- R's applications range from web development to machine learning (tidymodels, caret, mlr3) to reporting (Quarto, bookdown)
_Closing Thought: Every tool has its strengths. The key is to understand and leverage them effectively._
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
# Tidyverse
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Tidyverse
:::: {.columns}
::: {.column width="50%"}
- With SO many packages, there will inevitable be SO many opinions
- The tidyverse is a unified opinionated collection of R packages designed for data science
- All packages share an underlying design philosophy, grammar, and data structures
- Today R has in essense become two dialects `base` and `tidyverse`
- Note: This course will focus solely on the `tidyverse` dialect
:::
::: {.column width="50%"}
![](images/tidyverse_packages_hex_logos.png){fig-align="center" width=80%}
:::
::::
_We'll spend a lot more time on going over the details of the Tidyverse!_
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Intermezzo: A brief course History
:::: {.columns}
::: {.column width="40%"}
- From ~20 to ~150 students
- This year materials have been revised to suit large class room teaching
- The teaching team will do out best to support your learning, but it is important to emphasise, that you will have to take responsibility for following the course curriculum!
:::
::: {.column width="60%"}
```{r}
#| echo: false
#| message: false
#| warning: false
#| fig-align: center
#| fig-width: 10
#| fig-height: 8
library("tidyverse")
library("broom")
d <- tibble(
Year = c(2018, 2020, 2021, 2022, 2023, 2024),
Students = c(8, 38, 89, 91, 144, 130)
)
d |>
ggplot(aes(x = Year,
y = Students)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "red", linetype = "dashed") +
scale_x_continuous(breaks = 2018:2024) +
theme_classic(base_size = 20) +
theme(panel.grid.major = element_line())
```
:::
::::
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## General Course Outline
Tuesdays 08.00 - 12.00
- 08.00 - 08.30 Recap of key points from last weeks exercises
- 08.30 - 08.45 Introduction to theme of the day
- 08.45 - 09.00 Break
- 09.00 - 12.00 Exercises
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## About
### Course Description
- Basically, what can you expect to learn and what do I expect that you learn: [DTU Course Base](https://kurser.dtu.dk/course/22100)
### Course Resources
- Text Book: ["R for Data Science 2e"](https://r4ds.hadley.nz) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund
- Course site: [https://r4bds.github.io](https://r4bds.github.io)
### Course format
- Active Learning: _Very strong emphasis on students working in groups_, rather than me talking
- The focus is on you working actively, _not_ me talking
- I will _not_ go through all preparation materials in class
- Proper preparation is a prerequisite for completing lab exercises and maximising course yield
- I focus on supporting your independence, hence for some exercises you will have to seek out information (_I'm not a good data scientist, I'm just slightly better at googling than others_)
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Exercise feedback
### Weekly
- An exercise question will be highlighted
- Each group is responsible for crafting an answer to this highlighted question
- These answers will be hand-ins
- The following week, we will choose a random answer to be discussed in plenum
- Note: This starts from lab 2
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Group Formation
- Modern Bio Data Science is a team sport!
- You have to form a group of 4-5 students with a Shared Bio Data Science / Bioinformatics Area of Interest
- You will work in these groups throughout the course
- You will do the final project in these groups
- You will attend the exam in these groups
- Group work is a _very important_ meta skill for an engineer!
- Please fill in groups, see schedule for lab 1
- If you do not have a group, fill in your id and interest at an available group and someone might join you
- I aim to let you decide on the groups, I will of course be happy to help if needed
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## How to succeed in this course
- Prepare materials as instructed!
- Show up for class!
- Do the exercises!
- Do the project work!
_Basically, show up, follow the curriculum and you will do fine!_
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
# Questions?