ReproducibleCoding.Rmd

---
title: "TidyrCoding"
author: "Aud Halbritter & Richard Telford"
date: "10 7 2018"
output: word_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r setup_stg, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```


## CONTENT

- Workflow
- Data handling: tidyverese et al.
- Adopt a style guide
- Draw a plot

<br>

## WORKFLOW

#### Why do we care about workflow?
- It makes returning to the code much easier a few months down the line; whether revisiting an old project, or making revisions following peer review. 
- The results of your analysis are more easily scrutinised by the readers of your paper, meaning it is easier to show their validity. 
- Having clean and reproducible code available can encourage greater uptake of new methods that you have developed.

<br>

#### Clean, repeatable and script-based workflow
- Start your analysis from your raw data.
- Any cleaning, merging, transforming, etc. of data should be done in scripts, not manually.
- Long scripts become difficult to navigate. Split your scripts into logical thematic units:

```{r, eval = FALSE}

"ImportData.R" # load, merge and clean data
"MyFunctions.R" # put functions in separate files

"AnalyseData.R" # analyse data

"PlotFigures" # produce outputs like figures and tables

```

- Eliminate code duplication by packaging up useful code into custom functions.
- Make sure to comment your code and functions thoroughly, WHAT the code is doing and WHY. Explaining the expected inputs and outputs of functions.
- Document your code and data as comments in your scripts or by producing separate documentation.
- Any intermediary outputs generated by your workflow should be kept separate from raw data.

<br>

#### What is an optimal workflow?
1. Write code and functions
2. Program defensively
3. Comment thoroughly
4. Check and test your code
5. Document

In this tutorial we will focus on the (tidyr) coding part.

<br>

## Workflow - Task 1
a) Go to: https://github.com/EnquistLab/PFTC4_Svalbard
"Download" PFTC4 repo to your computer (green button at the right)
or
if you have a github account, "fork" the repo.

b) Explore the structure of the repo.

c) Open "Svalbard Analysis.Rproj" in RStudio. Load the data from google sheet using the following code:


```{r, eval = FALSE}
# load libraries
# you might have to install the packages if this is the first time you are using them:
install.packages("tidyverse") # use this line for each package you want to load.

library("tidyverse")
library("lubridate")
library("tpl")
library("googlesheets")

# little magic trick
pn <- . %>% print(n = Inf)

# Check which tables you have access to
gs_ls()
# which google sheets do you have access to?
trait <- gs_title("LeafTrait_Svalbard")
# list worksheets
gs_ws_ls(trait)
#download data
traits <- gs_read(ss = trait, ws = "Tabellenblatt1") %>% as.tibble()

  
```

d) look at the data and get familiar with the structure. What data does each column contain, etc.


<br>

## DATA HANDLING - DYPLR AND TIDYR

#### Pipe notation %>%
To avoid saving each step of your data handling, plotting or analysis or wraping everything in a function dplyr has a smart solution: the pipe operator %>% that is imported  from another package (magrittr). 
This operator allows you to pipe the output from one function to the input of another function. x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:

```{r, eval = FALSE}
traits %>%
  filter(Project == "T", Site == "B", Genus == "Bistorta") %>% 
  select(Wet_mass_g)
```

<br>

#### The most important functions
dplyr and tidyr use simple one word verbs as functions:

```select``` - select specific columns

```filter``` - filter specific content in rows

```arrange``` - sort rows

```mutate``` - change or add columns

```group_by``` - describe groups in the data for processing (ungroup to remove)

```summarise``` - summarise data (e.g. for certain groups)

```spread``` - transfrom table from thin to fat format

```gather``` - transform table from fat to thin format (data analysis usually require thin format)

<br>

## Taks 2 - get familiar with dplyr and tidyr

Use tidyverse notation!

a) Reduce the data set to all observataions from the bird cliff and elevational gradient, Elevation B and Bistorta vivipara.

b) Create a data frame with Site, Elevation, Plot, Genus, Species, Wet_Mass_g, Leaf_thickness 1 to 3. Add a new column to the data set which is called Mean_leaf_thickness_cm2. Then sort the data for Genus within Site.

c) Calculate the mean leaf area (Area_cm2) across all species separatly for each site. And then for each species across all sites.

d) Select the columns Site and Wet_mass_g and make a fat table with the different Sites in different columns. And then revers it to a thin table again.

<br>


## ADOPT A STYLE GUIDE

#### Why do we care about coding style?
 - Makes code easier to read
 - Makes code easier to debug (find mistakes)

Make your own style - but be consistent!
 
<br>

#### Use concise, descriptive and menaingful names
 - Names can contain letter numbers "_" and "."
 - Names must begin with a letter or "."
 - Avoid using names of existing functions -> confusing
 - Make names concise yet meaningful
 - Do not include reserved words (e.g. functions): TRUE, for, if
 
<br>

## Task 3 - Which names are valid? And improve the bad names?

 - Maximum Temp (°C)
 - 1st Obs.
 - min_height
 - max.height 
 - _age 
 - .mass 
 - MaxLength 
 - min length 
 - FALSE
 - 2widths 
 - celsius2kelvin 
 - plot
 
<br>

 
#### Spacing
White-space is free (!) and makes your code more readable. 
Place spaces around all infix operators (=, +, -, <-, etc.) and around = in function calls. 
Always put a space after a comma, and never before.

Exception: :, :: and ::: don’t need spaces around them.
:: notation tells R which package to use

##### Good
```{r, eval = FALSE}
average <- mean(feet / 12 + inches, na.rm = TRUE)
ChickWeight[1, ]
```
##### Bad
```{r, eval = FALSE}
average<-mean(feet/12+inches,na.rm=TRUE)
ChickWeight[1,]
```

##### Good
```{r, eval = FALSE}
x <- 1:10
base::get
```
##### Bad
```{r, eval = FALSE}
x <- 1 : 10
base :: get
```

 
<br>

#### Split long commands over multiple lines

##### Good
```{r, eval = FALSE}
traits %>%
  mutate(sum = Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm, 
         mean  = (Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm) / n)
```

##### Bad
```{r, eval = FALSE}
traits %>% mutate(sum = Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm, mean  = (Leaf_Thickness_1_mm + Leaf_Thickness_2_mm + Leaf_Thickness_3_mm) / n)

```

 
<br>

#### Indentation and comments makes code more readable

Use # to start comments.
##### Good
```{r, eval = FALSE}

traits %>%
  filter(Site == "X") %>% 
  
  # replace wrong species name
  mutate(Species = ifelse(Species == "Oxyra", "Oxyria", Species)) %>% 
  
  # calculate mean leaf area for each treatment and species
  group_by(Site, Elevation, Taxon) %>%
  summarise(mean = mean(Area_cm2))

```
##### Bad
```{r, eval = FALSE}

traits %>%
filter(Site == "X") %>% 
mutate(Species = ifelse(Species == "Oxyra", "Oxyria", Species)) %>% 
group_by(Site, Elevation, Taxon) %>%
summarise(mean = mean(Area_cm2))

```

Comments should help you and others to understand what you did. Comments can also be used to break up a file into readable chunks for navigation.

```{r, eval = FALSE}
#### Load data ####


####################
#### Plot data ####
####################


#****************************************************************

```

 
<br>

#### Assignment
Use <-, not =, for assignment.

##### Good
```{r, eval = FALSE}
x <- 5
```
##### Bad
```{r, eval = FALSE}
x = 5
```

 
<br>

#### Don't repeat yourself
Repeated code is hard to maintain. If you change the code, you need to change it in several places and it is hard to keep track. Use functions or smart code to avoid repetition (e.g. dplyr or tidyr).

## Task 4 - write code that calculates the mean wet weight for all species in each site

Hint: group_by and summarize


<br>

#### Avoid `attach()`
Unless you like strange bugs. It is very rarely useful to attach - many better options

[https://coderclub.b.uib.no/2016/05/03/dont-get-attached-to-attach/](https://coderclub.b.uib.no/2016/05/03/dont-get-attached-to-attach/)

 
<br>
 
#### Portable code: relative vs. absolute path
```{r, eval = FALSE}
# Absolute path -> needs to be changed on a different computer/user
"C:/project_root_folder/data/species_dat.csv"

# Relative path -> works for everybody
"data/species_dat.csv"
```

 
<br>

#### Defensive programming
Use code that works today and in a year. The code should work with the data set you have today, but also next year if you add another year of data.

##### Good
```{r, eval = FALSE}
# remove observation
traits %>% 
  filter(ID != "AGV3567")

# flag a wrong observation
traits %>% 
  mutate(Flag = ifelse(ID == "AGV3567", "wrong LeafArea", NA))
```

##### Bad
```{r, eval = FALSE}
# remove first row (will not work if the datasheet is changed)
dat %>% 
  slice(-1)
```

 
<br>

## MAKE A PLOT

A very quick intro to ggplot:
- Components are added together with a +
- 

Structure:
ggplot(DATA, aes(x = X-AXIS, y = Y-AXIS, OTHER ARGUMENTS LIKE COLOR, SHAPE, LINETYPE)) +
  geom_point() # drawing points


## Task 5 - Check the trait data with plotting

a) Draw a plot for Wet_Mass_g against Area_cm2 on a log scale.

b) Add a 1:1 line to the plot.

c) Color all the points where Wet_Mass_g is smaller than 0.1g.

d) Draw a plot only for the ITEX data, and make a separate plot for control and warming plots. Give each species a different color. Do not draw the legend.

```{r, eval = FALSE}

ggplot(traits, aes(x = , y = )) + 
      geom_point() +   
      geom_abline() +

```

 
<br>

## Further Reading

British Ecological Society:
- A Guide to Data Management in Ecology and Evolution
- A Guide to Reproducible Code iin Ecology and Evolution

Google's R Style Guide [https://google.github.io/styleguide/Rguide.xml](https://google.github.io/styleguide/Rguide.xml)

Hadley Wickham, H. Style Guide _Advanced R_
[http://adv-r.had.co.nz/Style.html](http://adv-r.had.co.nz/Style.html)

RStudio Cheat Sheet
https://www.rstudio.com/resources/cheatsheets/