forked from datacarpentry/R-genomics
-
Notifications
You must be signed in to change notification settings - Fork 5
/
05-visualisation-ggplot2.Rmd
294 lines (213 loc) · 7.46 KB
/
05-visualisation-ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
layout: topic
title: Data visualisation with ggplot2
subtitle: Visualising data in R with ggplot2 package
minutes: 60
---
<!---
show hide magic
<style> div.hidecode + pre {display: none} div.hidecode {color: #337ab7}</style><script> doclick=function(e){ e.nextSibling.nextSibling.style.display = e.nextSibling.nextSibling.style.display === "block" ? "none" : "block"; }</script>
-->
```{r}
knitr::opts_chunk$set(fig.keep='last')
```
```{r setup, echo=FALSE, purl=FALSE}
source("setup.R")
```
Authors: **Mateusz Kuzak**, **Diana Marek**, **Hedi Peterson**
#### Disclaimer
We will here using functions of ggplot2 package. There are basic ploting
capabilities in basic R, but ggplot2 adds more powerful plotting capabilities.
> ### Learning Objectives
>
> - Visualise some of the
>[mammals data](http://figshare.com/articles/Portal_Project_Teaching_Database/1314459)
>from Figshare [surveys.csv](http://files.figshare.com/1919744/surveys.csv)
> - Understand how to plot these data using R ggplot2 package. For more details
>on using ggplot2 see
>[official documentation](http://docs.ggplot2.org/current/).
> - Building step by step complex plots with ggplot2 package
Load required packages
```{r}
# plotting package
library(ggplot2)
# piping / chaining
library(magrittr)
# modern dataframe manipulations
library(dplyr)
```
Load data directly from figshare.
```{r}
surveys_raw <- read.csv("http://files.figshare.com/1919744/surveys.csv")
```
`surveys.csv` data contains some measurements of the animals caught in plots.
## Data cleaning and preparing for plotting
Let's look at the summary
```{r}
summary(surveys_raw)
```
There are few things we need to clean in the dataset.
There is missing species_id in some records. Let's remove those.
```{r}
surveys <- surveys_raw %>%
filter(species_id != "")
```
There are a lot of species with low counts, let's remove the ones below 10 counts
```{r}
# count records per species
species_counts <- surveys %>%
group_by(species_id) %>%
summarise(n=n())
# get names of those frequent species
frequent_species <- species_counts %>%
filter(n >= 10) %>%
select(species_id)
surveys <- surveys %>%
filter(species_id %in% frequent_species$species_id)
```
We saw in summary, there were NA's in weight and hindfoot_length. Let's remove
rows with missing weights.
```{r}
surveys_weight_present <- surveys %>%
filter(!is.na(weight))
```
> ### Challenge
>
> - Do the same to remove rows without `hindfoot_length`. Save results in the new dataframe.
```{r}
surveys_length_present <- surveys %>%
filter(!is.na(hindfoot_length))
```
- How would you get the dataframe without missing values?
```{r}
surveys_complete <- surveys_weight_present %>%
filter(!is.na(hindfoot_length))
```
> We can chain filtering together using pipe operator (`%>%`) introduced earlier.
```{r}
surveys_complete <- surveys %>%
filter(!is.na(weight)) %>%
filter(!is.na(hindfoot_length))
```
> Make simple scatter plot of `hindfoot_length` (in millimeters) as a function of
> `weight` (in grams), using basic R plotting capabilities.
```{r}
plot(x=surveys_complete$weight, y=surveys_complete$hindfoot_length)
```
## Plotting with ggplot2
We will make the same plot using `ggplot2` package.
`ggplot2` is a plotting package that makes it sipmple to create complex plots
from data in a dataframe. It uses default settings, which help creating
publication quality plotts with minimal amount of settings and tweaking.
With ggplot graphics are build step by step by adding new elements.
To build a ggplot we need to:
- bind plot to a specific data frame
```{r, eval=FALSE}
ggplot(surveys_complete)
```
- define aestetics (`aes`), that maps variables in the data to axes on the plot
or to plotting size, shape color, etc.,
```{r}
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length))
```
- add `geoms` -- graphical representation of the data in the plot (points,
lines, bars). To add a geom to the plot use `+` operator:
```{r}
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
```
## Modifying plots
- adding transparency (alpha)
```{r}
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha=0.1)
```
- adding colors
```{r}
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha=0.1, color="blue")
```
Example of complex visualisation in which plot area is divided into hexagonal
sections and points are counted wihin hexagons. The number of points per hexagon
is encoded by color.
```{r}
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) + stat_binhex(bins=50) +
scale_fill_gradientn(trans="log10", colours = heat.colors(10, alpha=0.5))
```
## Boxplot
Visualising the distribution of weight within each species.
```{r}
ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
geom_boxplot()
```
By adding points to boxplot, we can see particular measurements and the
abundance of measurements.
```{r}
ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
geom_jitter(alpha=0.3, color="tomato") +
geom_boxplot(alpha=0)
```
> ### Challenge
>
> Create boxplot for `hindfoot_length`.
## Plotting time series data
Let's calculate number of counts per year for each species. To do that we need
to group data first and count records within each group.
```{r}
yearly_counts <- surveys %>%
group_by(year, species_id) %>%
summarise(count=n())
```
Timelapse data can be visualised as a line plot with years on x axis and counts
on y axis.
```{r}
ggplot(yearly_counts, aes(x=year, y=count)) +
geom_line()
```
Unfortunately this does not work, because we plot data for all the species
together. We need to tell ggplot to split graphed data by `species_id`
```{r}
ggplot(yearly_counts, aes(x=year, y=count, group=species_id)) +
geom_line()
```
We will be able to distiguish species in the plot if we add colors.
```{r}
ggplot(yearly_counts, aes(x=year, y=count, group=species_id, color=species_id)) +
geom_line()
```
## Faceting
ggplot has a special technique called *faceting* that allows to split one plot
into mutliple plots based on some factor. We will use it to plot one time series
for each species separately.
```{r}
ggplot(yearly_counts, aes(x=year, y=count, color=species_id)) +
geom_line() + facet_wrap(~species_id)
```
Now we wuld like to split line in each plot by sex of each individual
measured. To do that we need to make counts in dataframe grouped by sex.
> ### Challenges:
>
> - filter the dataframe so that we only keep records with sex "F" or "M"s
>
```{r}
sex_values = c("F", "M")
surveys <- surveys %>%
filter(sex %in% sex_values)
```
> - group by year, species_id, sex
```{r}
yearly_sex_counts <- surveys %>%
group_by(year, species_id, sex) %>%
summarise(count=n())
```
> - make the faceted plot spliting further by sex (within single plot)
```{r}
ggplot(yearly_sex_counts, aes(x=year, y=count, color=species_id, group=sex)) +
geom_line() + facet_wrap(~ species_id)
```
> We can improve the plot by coloring by sex instead of species (species are
> already in separate plots, so we don't need to distinguish them better)
```{r}
ggplot(yearly_sex_counts, aes(x=year, y=count, color=sex, group=sex)) +
geom_line() + facet_wrap(~ species_id)
```