-
Notifications
You must be signed in to change notification settings - Fork 7
/
ReproducibleCoding.Rmd
408 lines (271 loc) · 10.1 KB
/
ReproducibleCoding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "TidyrCoding"
author: "Aud Halbritter & Richard Telford"
date: "10 7 2018"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r setup_stg, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
## CONTENT
- Workflow
- Data handling: tidyverese et al.
- Adopt a style guide
- Draw a plot
<br>
## WORKFLOW
#### Why do we care about workflow?
- It makes returning to the code much easier a few months down the line; whether revisiting an old project, or making revisions following peer review.
- The results of your analysis are more easily scrutinised by the readers of your paper, meaning it is easier to show their validity.
- Having clean and reproducible code available can encourage greater uptake of new methods that you have developed.
<br>
#### Clean, repeatable and script-based workflow
- Start your analysis from your raw data.
- Any cleaning, merging, transforming, etc. of data should be done in scripts, not manually.
- Long scripts become difficult to navigate. Split your scripts into logical thematic units:
```{r, eval = FALSE}
"ImportData.R" # load, merge and clean data
"MyFunctions.R" # put functions in separate files
"AnalyseData.R" # analyse data
"PlotFigures" # produce outputs like figures and tables
```
- Eliminate code duplication by packaging up useful code into custom functions.
- Make sure to comment your code and functions thoroughly, WHAT the code is doing and WHY. Explaining the expected inputs and outputs of functions.
- Document your code and data as comments in your scripts or by producing separate documentation.
- Any intermediary outputs generated by your workflow should be kept separate from raw data.
<br>
#### What is an optimal workflow?
1. Write code and functions
2. Program defensively
3. Comment thoroughly
4. Check and test your code
5. Document
In this tutorial we will focus on the (tidyr) coding part.
<br>
## Workflow - Task 1
a) Go to: https://github.com/EnquistLab/PFTC4_Svalbard
"Download" PFTC4 repo to your computer (green button at the right)
or
if you have a github account, "fork" the repo.
b) Explore the structure of the repo.
c) Open "Svalbard Analysis.Rproj" in RStudio. Load the data from google sheet using the following code:
```{r, eval = FALSE}
# load libraries
# you might have to install the packages if this is the first time you are using them:
install.packages("tidyverse") # use this line for each package you want to load.
library("tidyverse")
library("lubridate")
library("tpl")
library("googlesheets")
# little magic trick
pn <- . %>% print(n = Inf)
# Check which tables you have access to
gs_ls()
# which google sheets do you have access to?
trait <- gs_title("LeafTrait_Svalbard")
# list worksheets
gs_ws_ls(trait)
#download data
traits <- gs_read(ss = trait, ws = "Tabellenblatt1") %>% as.tibble()
```
d) look at the data and get familiar with the structure. What data does each column contain, etc.
<br>
## DATA HANDLING - DYPLR AND TIDYR
#### Pipe notation %>%
To avoid saving each step of your data handling, plotting or analysis or wraping everything in a function dplyr has a smart solution: the pipe operator %>% that is imported from another package (magrittr).
This operator allows you to pipe the output from one function to the input of another function. x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:
```{r, eval = FALSE}
traits %>%
filter(Project == "T", Site == "B", Genus == "Bistorta") %>%
select(Wet_mass_g)
```
<br>
#### The most important functions
dplyr and tidyr use simple one word verbs as functions:
```select``` - select specific columns
```filter``` - filter specific content in rows
```arrange``` - sort rows
```mutate``` - change or add columns
```group_by``` - describe groups in the data for processing (ungroup to remove)
```summarise``` - summarise data (e.g. for certain groups)
```spread``` - transfrom table from thin to fat format
```gather``` - transform table from fat to thin format (data analysis usually require thin format)
<br>
## Taks 2 - get familiar with dplyr and tidyr
Use tidyverse notation!
a) Reduce the data set to all observataions from the bird cliff and elevational gradient, Elevation B and Bistorta vivipara.
b) Create a data frame with Site, Elevation, Plot, Genus, Species, Wet_Mass_g, Leaf_thickness 1 to 3. Add a new column to the data set which is called Mean_leaf_thickness_cm2. Then sort the data for Genus within Site.
c) Calculate the mean leaf area (Area_cm2) across all species separatly for each site. And then for each species across all sites.
d) Select the columns Site and Wet_mass_g and make a fat table with the different Sites in different columns. And then revers it to a thin table again.
<br>
## ADOPT A STYLE GUIDE
#### Why do we care about coding style?
- Makes code easier to read
- Makes code easier to debug (find mistakes)
Make your own style - but be consistent!
<br>
#### Use concise, descriptive and menaingful names
- Names can contain letter numbers "_" and "."
- Names must begin with a letter or "."
- Avoid using names of existing functions -> confusing
- Make names concise yet meaningful
- Do not include reserved words (e.g. functions): TRUE, for, if
<br>
## Task 3 - Which names are valid? And improve the bad names?
- Maximum Temp (°C)
- 1st Obs.
- min_height
- max.height
- _age
- .mass
- MaxLength
- min length
- FALSE
- 2widths
- celsius2kelvin
- plot
<br>
#### Spacing
White-space is free (!) and makes your code more readable.
Place spaces around all infix operators (=, +, -, <-, etc.) and around = in function calls.
Always put a space after a comma, and never before.
Exception: :, :: and ::: don’t need spaces around them.
:: notation tells R which package to use
##### Good
```{r, eval = FALSE}
average <- mean(feet / 12 + inches, na.rm = TRUE)
ChickWeight[1, ]
```
##### Bad
```{r, eval = FALSE}
average<-mean(feet/12+inches,na.rm=TRUE)
ChickWeight[1,]
```
##### Good
```{r, eval = FALSE}
x <- 1:10
base::get
```
##### Bad
```{r, eval = FALSE}
x <- 1 : 10
base :: get
```
<br>
#### Split long commands over multiple lines
##### Good
```{r, eval = FALSE}
traits %>%
mutate(sum = Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm,
mean = (Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm) / n)
```
##### Bad
```{r, eval = FALSE}
traits %>% mutate(sum = Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm, mean = (Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm) / n)
```
<br>
#### Indentation and comments makes code more readable
Use # to start comments.
##### Good
```{r, eval = FALSE}
traits %>%
filter(Site == "X") %>%
# replace wrong species name
mutate(Species = ifelse(Species == "Oxyra", "Oxyria", Species)) %>%
# calculate mean leaf area for each treatment and species
group_by(Site, Elevation, Taxon) %>%
summarise(mean = mean(Area_cm2))
```
##### Bad
```{r, eval = FALSE}
traits %>%
filter(Site == "X") %>%
mutate(Species = ifelse(Species == "Oxyra", "Oxyria", Species)) %>%
group_by(Site, Elevation, Taxon) %>%
summarise(mean = mean(Area_cm2))
```
Comments should help you and others to understand what you did. Comments can also be used to break up a file into readable chunks for navigation.
```{r, eval = FALSE}
#### Load data ####
####################
#### Plot data ####
####################
#****************************************************************
```
<br>
#### Assignment
Use <-, not =, for assignment.
##### Good
```{r, eval = FALSE}
x <- 5
```
##### Bad
```{r, eval = FALSE}
x = 5
```
<br>
#### Don't repeat yourself
Repeated code is hard to maintain. If you change the code, you need to change it in several places and it is hard to keep track. Use functions or smart code to avoid repetition (e.g. dplyr or tidyr).
## Task 4 - write code that calculates the mean wet weight for all species in each site
Hint: group_by and summarize
<br>
#### Avoid `attach()`
Unless you like strange bugs. It is very rarely useful to attach - many better options
[https://coderclub.b.uib.no/2016/05/03/dont-get-attached-to-attach/](https://coderclub.b.uib.no/2016/05/03/dont-get-attached-to-attach/)
<br>
#### Portable code: relative vs. absolute path
```{r, eval = FALSE}
# Absolute path -> needs to be changed on a different computer/user
"C:/project_root_folder/data/species_dat.csv"
# Relative path -> works for everybody
"data/species_dat.csv"
```
<br>
#### Defensive programming
Use code that works today and in a year. The code should work with the data set you have today, but also next year if you add another year of data.
##### Good
```{r, eval = FALSE}
# remove observation
traits %>%
filter(ID != "AGV3567")
# flag a wrong observation
traits %>%
mutate(Flag = ifelse(ID == "AGV3567", "wrong LeafArea", NA))
```
##### Bad
```{r, eval = FALSE}
# remove first row (will not work if the datasheet is changed)
dat %>%
slice(-1)
```
<br>
## MAKE A PLOT
A very quick intro to ggplot:
- Components are added together with a +
-
Structure:
ggplot(DATA, aes(x = X-AXIS, y = Y-AXIS, OTHER ARGUMENTS LIKE COLOR, SHAPE, LINETYPE)) +
geom_point() # drawing points
## Task 5 - Check the trait data with plotting
a) Draw a plot for Wet_Mass_g against Area_cm2 on a log scale.
b) Add a 1:1 line to the plot.
c) Color all the points where Wet_Mass_g is smaller than 0.1g.
d) Draw a plot only for the ITEX data, and make a separate plot for control and warming plots. Give each species a different color. Do not draw the legend.
```{r, eval = FALSE}
ggplot(traits, aes(x = , y = )) +
geom_point() +
geom_abline() +
```
<br>
## Further Reading
British Ecological Society:
- A Guide to Data Management in Ecology and Evolution
- A Guide to Reproducible Code iin Ecology and Evolution
Google's R Style Guide [https://google.github.io/styleguide/Rguide.xml](https://google.github.io/styleguide/Rguide.xml)
Hadley Wickham, H. Style Guide _Advanced R_
[http://adv-r.had.co.nz/Style.html](http://adv-r.had.co.nz/Style.html)
RStudio Cheat Sheet
https://www.rstudio.com/resources/cheatsheets/