-
Notifications
You must be signed in to change notification settings - Fork 4
/
Lecture09.Rmd
597 lines (453 loc) · 18.4 KB
/
Lecture09.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
---
title: "MBB 243 - Lecture 09"
author: "Ryan Morin"
date: 2022/03/23
output:
powerpoint_presentation: default
---
```{r setup, include=FALSE}
library(tidyverse)
library(ggbeeswarm)
library(ggthemes)
library(cowplot)
Sys.setenv(RETICULATE_PYTHON = "~/miniconda3/bin/python")
covid <- read_csv("data/covid19_cases_worldwide_monthly.csv")
require(reticulate)
#knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(
fig.width = 6,
fig.asp = 0.8,
out.width = "80%"
)
# if necessary, use this command to install required non-standard command-line tools:
# conda install -c bioconda samtools seqtk
```
# Learning goals for the week
- Basic plotting types and aesthetics with ggplot
- Renounce pie charts for life
- Some common techniques for formatting and manipulating data for making usful plots
- Putting ggplot to work and learning to tame some of it's quirks
# Data viz: a key part of EDA
- As we discussed earlier, exploratory data analysis reqires visualization of your variables
- ggplot2 is an integral part of tidyverse and plays nicely with tidy data so is an excellent tool for interactive data visualization
- Mastering data visualization in R means you can generate publication-quality figures and finally abandon Excel, at least for data presentation purposes
# ggplot2 procedure
- We build a plot in layers in a series of steps, first by calling `ggplot()` and providing some data (directly as a data frame or via a pipe)
- In the call to `ggplot()` or in subsequent `geom` calls, we map some aesthetics to the variables
- Each geometry `geom` call adds a plot layer to represent your data
- New layers will share the coordinate system
- Legends and axes are set up for you but you can provide a theme to the plot and/or adjust the layout with calls to `theme` functions
# Deconsructing a toy car example
```{r,eval=F,echo=T}
# tell ggplot we will be using the mpg data set that is bundled with this package
# this creates an empty graph that is useless without a plotting function
ggplot(data = mpg) +
# The + allows us to add on or more layers to our plot
geom_point(mapping = aes(x = displ, y = hwy))
# geom_point will create an x/y scatter plot
# the mapping argument is always coupled with aesthetics aes()
# and is used to set variables for the axes and additional mappings (e.g. colour)
```
# Data
- Fuel economy (mpg) of a large variety of consumer vehicles
```{r,eval=T,echo=T}
select(mpg,displ,hwy,class,manufacturer) %>% slice_sample(n = 6)
```
# Viz
```{r,eval=T,echo=F}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
# Tweaking the basic plot
- Map a colour to the points based on another variable
```{r,eval=T,echo=T,message=F,warning=F}
# note: we can drop mapping= as long as aes() is the first/only argument
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = manufacturer))
```
# Common mistake
- You can globally control plot features instead of mapping them to variables
- Only works in the expected way if you specify _outside_ `aes()`
```{r,eval=T,echo=T,message=F,warning=F}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy,
colour = "blue"))
```
# The correct approach
- You can globally control plot features instead of mapping them to variables
- Only works in the expected way if you specify _outside_ `aes()`
```{r,eval=T,echo=T,message=F,warning=F}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy),
colour = "blue")
```
# Tweaking the basic plot
- Map a colour to the points based on another variable
- Shape your points based on yet ANOTHER variable
```{r,eval=T,echo=T,message=F,warning=F}
# This code throws this warning at you:
# The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.
# Removed 62 rows containing missing values*
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = manufacturer, shape = class)) +
theme(legend.position="bottom")
```
# Tweaking the basic plot
- Map a colour to the points based on another variable
- Make the grey background go away by changing the theme
```{r,eval=T,echo=T,message=F}
# other nice options to try out:
# theme_classic()
# theme_cowplot() #requires the cowplot library
ggplot(data = mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = manufacturer)) +
theme_minimal()
```
# Controlling how colours are mapped
- Map your colour to a continuous numeric value (rather than categorical)
- Automatic behaviour if your variable is numeric, which may not actually be what you want
- Controlling the colour scale will be covered in the lab
```{r,eval=T,echo=T}
ggplot(data = mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = cyl)) +
theme_cowplot()
```
# Overriding default colouring
- You can prevent a continuous colour scale from being used by converting the type of your variable
- Categorical variables should be factors
- Variable must have a sensible number of discrete values for this to work
```{r,eval=F,echo=T}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = factor(cyl))) +
theme_cowplot()
```
# Does this imply another plot type?
```{r,eval=T,echo=F}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = factor(cyl))) +
theme_cowplot()
```
# Notes about aesthetics
- Shape, size and colour and alpha (transparency) can all be associated with variables
- Colour/shape are more useful for categorical variables
- Size and colour gradiants are more useful for continuous variables
- Scale range for size/color gradients are automatically determined
- Getting too ambitious with mapping more than one at a time usually results in a mess
```{r,eval=F,echo=T}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = factor(cyl),size=cty,shape=drv,alpha=year))
```
# Too much?
```{r,eval=T,echo=F}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, colour = factor(cyl),size=cty,shape=drv,alpha=year)) +
theme_cowplot()
```
# Fitting trend lines to your data
- In this example, we've shifted the aesthetic mapping to the call to ggplot, which causes them to be considered 'global' until they are changed
- `geom_smooth` is used here to fit a linear model to the data and add the linear representation for each group
- groups are implicitly assigned based on your colour assignment
```{r,echo=T,eval=T,message=F,warning=F}
ggplot(data = mpg,mapping = aes(x = displ, y = hwy, colour = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
# Altering aesthetic mapping by layer
- Sometimes the aesthetics mapped in earlier layers cause a subsequent `geom` to misbehave
- Explicitly defining `aes()` with each geom is one option
- You can also use `inherit.aes=FALSE` if you only want to disable the global aesthetic in one layer
```{r,echo=T,eval=T,message=F,warning=F}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ,
y = hwy,
colour = factor(cyl))) +
geom_smooth(method = "lm", se = FALSE,
mapping = aes(x = displ,
y = hwy))
```
# Choosing the right plot type
- Scatter plots are a good place to start but may not convey the information in the way you want
- Coloured scatterplots may help you recognize if a categorical variable is associated with one or more continuous variables
- With this information, you can switch to plots that group data and show the distribution such as box plots or histograms*
# Box plot of car fuel economy
```{r,eval=T,echo=T}
# Note: we're converting cyl to a factor representation on-the-fly
ggplot(data = mpg) +
geom_boxplot(aes(x = factor(cyl),
y = hwy,
colour = factor(cyl)))
```
# More layer ridiculousness
- There are a few things wrong with this layering that highlight some common errors
- Thoughts on what causes the x scale to behave this way?
- How do we fix it?
```{r,eval=T,echo=T}
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = factor(cyl), y = hwy, colour = factor(cyl))) +
geom_point(mapping = aes(x = cyl, y = hwy, colour = factor(cyl))) +
theme_cowplot()
```
# Consistency is key
- X axis scale will be numeric if the data are numeric, which allows the order to be determined "naturally" but will create a gap where a X value is missing whereas factors are not treated as numeric for axes
- Since the aesthetics are shared, we can move them to the first `ggplot` call so they are inherited
```{r,eval=T,echo=T}
ggplot(data = mpg,
aes(x = cyl, y = hwy, colour = factor(cyl))) +
geom_boxplot() +
geom_point() +
theme_cowplot()
```
# Quasirandom instead of just points
- Overlaying points on a boxplot is more informative when you add some randomness to the x coordinate to spread them out
```{r,eval=T,echo=T}
ggplot(data = mpg,
aes(x = factor(cyl), y = hwy, colour = factor(cyl))) +
geom_boxplot(outlier.shape = NA) +
ggbeeswarm::geom_quasirandom() +
theme_cowplot()
```
# Tidying some biological data
- Here we'll combine the gene expression for two samples to make a tidy table
```{r,eval=T,echo=T,message=F,warning=F}
gexp1=read_csv(file="data/GSE125966_part1.csv") %>% rename("expression"=2) %>% mutate(sample_id="sample12")
gexp2=read_csv(file="data/GSE125966_part2.csv") %>% rename("expression"=2) %>% mutate(sample_id="sample23")
two_gexp = bind_rows(gexp1,gexp2)
head(two_gexp,6)
```
# Tweaking axes
- You will sometimes need to limit/adjust the range of data points used in your plot
- Axes can be truncated manually if desired
```{r, eval=T,echo=T,message=F,warning=F}
ggplot(two_gexp,aes(x=expression)) +
geom_histogram() + xlim(c(-4,15)) +
facet_wrap(~sample_id,nrow=2)
```
# Tweaking axes
- You will sometimes need to limit/adjust the range of data points used in your plot
- Axes can be truncated manually if desired
- Usually the better option is to filter your data up-front
```{r, eval=T,echo=T,message=F,warning=F}
dplyr::filter(two_gexp,expression > -4 ) %>%
ggplot(aes(x=expression)) +
geom_histogram(aes(colour=sample_id)) +
facet_wrap(~sample_id,nrow=2)
```
# Colour vs fill
- The `colour` aesthetic usually means the outline of a shape
- For shapes that are hollow, you must specify `fill`
```{r, eval=T,echo=T,message=F,warning=F}
dplyr::filter(two_gexp,expression > -4 ) %>%
ggplot(aes(x=expression)) +
geom_histogram(aes(fill=sample_id)) +
facet_wrap(~sample_id,nrow=2)
```
# Boxplot with many points
- Plotting thousands of data points is easy (even if not often useful in this form)
- Crowding can somewhat be mitigated by setting a low alpha to make points semi-transparent
```{r,echo=T,eval=T}
dplyr::filter(two_gexp,expression > -4) %>%
group_by(sample_id) %>%
slice_sample(n=2000) %>%
ggplot(aes(x=sample_id,y=expression)) +
geom_boxplot(outlier.shape = NA) +
geom_quasirandom(alpha=0.2)
```
# Popular alternative: violin plots
- I'm less of a fan but these are used a lot as an alternative to box plots
```{r,echo=T,eval=T}
dplyr::filter(two_gexp,expression > -4) %>%
mutate(logexp=log(expression)) %>%
group_by(sample_id) %>%
slice_sample(n=2000) %>%
ggplot(aes(x=sample_id,y=expression)) +
geom_violin(aes(fill=sample_id))
```
# When tidy is too tidy
- There are some scenarios in which you want to compare related variables between samples
- Requires mapping aesthetics to subsets of the same column
- Difficult to accomplish in a tidy format but can be dealt with with pivots or combinations of filter/join
```{r,echo=T,eval=T}
paired_exp =
pivot_wider(two_gexp,
id_cols = "gene",
names_from = "sample_id",
values_from="expression")
```
# Comparing two samples
```{r,echo=T,eval=T}
ggplot(paired_exp,aes(x=sample12,y=sample23)) +
geom_point(alpha=0.2) +
xlim(c(-4,15)) +
ylim(c(-4,15)) + theme_cowplot()
```
# Adding useful colouring
- We can derive a new column that will allow the points to be coloured based on the relationship between the two variables using mutate
```{r,echo=T,eval=T}
paired_exp = paired_exp %>%
mutate(diff=sample23-sample12)
# calculate the difference between the two values
ggplot(paired_exp,aes(x=sample12,y=sample23)) +
geom_point(aes(colour=diff)) +
scale_colour_continuous(type="viridis")
```
# What can we learn from this?
- Which genes are likely more highly expressed in sample23?
- Which genes are likely more highly expressed in sample12?
# Okay... what about normal charts?
- Bar plots?
- Pie charts?
- Line charts?
# No pie charts!
:::::::::::::: {.columns}
::: {.column}
- Pie charts are a terrible way to represent any type of data
- Human brains cannot convert wedges to area accurately, especially when differences are subtle
- [Bar charts can always replace pie charts and are always better](https://www.data-to-viz.com/caveat/pie.html)
- 3D exploding pie charts are nonsense
:::
::: {.column}
![](images/self_description.png)
:::
::::::::::::::
# Basic line chart with title
```{r,echo=T,eval=T,message=F,warning=F}
cd <- "data/covid19_cases_provinces_weekly.csv"
canada <- read_csv(cd)
# adding a title is as simple as layering a call to ggtitle()
ggplot(canada, aes(x = week,
y = cases,
colour = province)) +
geom_line() +
ggtitle("COVID19 cases in Canada by province")
```
# Adding points as a new layer
```{r,echo=T,eval=T,message=F,warning=F}
ggplot(canada, aes(x = week,
y = cases,
colour = province)) +
geom_line() +
geom_point()
```
# When are line charts (un)acceptable?
- Is your X axis a continuous numeric value?
- if no, then don't use a line chart
- Consider bar chart or boxplot instead.
- Do you want to show the actual change between adjacent points instead of the trend?
- if no, then consider a scatter plot with some smoothing
# Smoothing instead of a line plot
```{r,echo=T,eval=T,message=F,warning=F}
ggplot(canada, aes(x = week,
y = cases,
colour = province)) +
geom_point() +
geom_smooth(stat="smooth",span=0.3)
#change span to affect how much the line is affected by individual points
```
# Preparing a basic bar chart
- Let's make a bar chart that shows percentages (i.e. the heights sum to 100)
- This is the equivalent to a pie chart but representing the wedges as bars
- We will represent the COVID19 deaths per province so first we need per-province totals and a denominator
```{r,echo=T,eval=T,message=F,warning=F}
canada_deaths = canada %>%
group_by(province) %>%
summarize(province_deaths=sum(deaths)) %>%
ungroup() %>%
mutate(country_deaths=sum(province_deaths))
```
# Plotting the data
- In ggplot, `geom_bar` naturally wants to plot the total number of occurrences in each group, which is not usually what we actually want
- Instead, we can use `geom_col` (column chart?)
- Note what our X axis labels look like when the group names are longer
```{r,echo=T,eval=T,message=F,warning=F}
# note that we can use mutate-like syntax within `aes` to calculate a new value for plotting but notice how it affects your axis labels
ggplot(canada_deaths,
aes(x=province,
y=100*province_deaths/country_deaths)) +
geom_col()
```
# A prettier version
```{r,echo=T,eval=T,message=F,warning=F}
# note that we can use mutate-like syntax within `aes` to calculate
# a new value for plotting but notice how it affects your axis labels
ggplot(canada_deaths,aes(x=province,y=100*province_deaths/country_deaths,
colour=province,
fill=province)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
# What if we adjust for population?
```{r,echo=T,eval=T,message=F,warning=F}
canada_deaths_pop = canada %>%
group_by(province) %>%
mutate(province_deaths=sum(deaths)) %>%
slice_head() %>%
ungroup() %>%
mutate(death_per_million = 1000000 * province_deaths / population)
```
# A more fair representation
```{r,echo=T,eval=T,message=F,warning=F}
# note that we can use mutate-like syntax within `aes` to calculate
# a new value for plotting but notice how it affects your axis labels
ggplot(canada_deaths_pop,
aes(x=province,
y=death_per_million,
colour=province,fill=province)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position = 'none')
```
# Altering the X axis order
- Default order for axis is numeric or alphabetical (for character) ordering whereas factors are ordered based on the order of their "levels"
- Factor levels are set when you define a factor vector and default to numeric/alphabetic ordering
- If we combine `factor()` with `unique()` it will instead set the order to be based on the order of the first occurrence of each value in the vector
# Altering the X axis order
```{r,echo=T,eval=T}
my_char = c("C","Z","C","A","D","E","Z","C")
my_fac = factor(my_char)
my_fac
my_fac = factor(my_char,levels=unique(my_char))
my_fac
```
# Order X using the value of Y
```{r,echo=T,eval=T}
#use arrange to order data frame
canada_deaths_pop = arrange(canada_deaths_pop,death_per_million) %>%
mutate(
province_ordered=
factor(province,levels=unique(province)))
# ensure you change the y for all your aesthetics
ggplot(canada_deaths_pop,
aes(x=province_ordered,
y=death_per_million,
colour=province_ordered,
fill=province_ordered)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
# What if I like retro excel plots?
```{r,echo=T,eval=T}
#library(ggthemes)
filter(canada_deaths_pop,
death_per_million>10) %>%
ggplot(aes(x=province_ordered,
y=death_per_million,
fill=province_ordered)) +
geom_col() +
theme_excel(base_size = 12,
base_family = "",
horizontal = TRUE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_fill_excel()
```
# Recap
- Each geometry has its own nuances that you will need to get comfortable with
- Layering helps you build up complex plots, which makes each component easier to debug
- Themes and scales both give you control over visuals such as labels and visual appearance/colours
- The type of your variable affects how axes behave.
- Try to consider whether something is a categorical variable and use factors when the answer is "yes".