dplyr.Rmd

# Data manipulation

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE)
```

Package `dplyr` introduces a grammar of data manipulation. See the nice [cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)
<br>

We will first introduce the **5 intuitively-named key functions** from **{dplyr}**:
<br>

```{r, echo=FALSE, eval=TRUE}
knitr::kable(
  data.frame(name=c("**`mutate`**", "**`select`**", "**`filter`**", "**`summarise`**", "**`arrange`**"), 
             `what it does`=c("adds new variables (columns) that are functions of existing variables", "picks variables (columns) based on their names.", "picks observations (rows) based on their values.", "collapses multiple values down to a single summary.", "changes the ordering of the rows.")), caption = 'the 5 core `dplyr` functions',
  format = "html", table.attr = "style='width:90%;'"
)
```

<br>

All 5 functions work in a similar and consistent way:

* The first argument is the input: a `data frame` or a `tibble`.
* The output is a new `tibble`.
 
> *Note that {dplyr} never modifies the input: you need to* ***redirect the output*** *and save in a new - or the same - object.*

<br><br> 
We will use the `presidential` data set from the `ggplot2` package.
*It contains data of the terms of* ***presidents of the USA***, *from Eisenhower to Obama:*

* Name
* Term starting date
* Term ending date of mandate
* Political party
 
```{r, echo=F, warning=F, message=F}
library(kableExtra)
knitr::kable(ggplot2::presidential, caption="`presidential` data set")
```


## mutate & transmute

`mutate` allows to create new columns that are functions of the existing ones.

* Create a new column with the duration of each term:

```{r, eval=FALSE}
# Subtracting column start to column end
mutate(presidential, 
    duration_days=end-start)
```

> *Notes:*

> * Use **unquoted** column names.
> * Columns are added at the end of the data frame.
> * `mutate` keeps all columns. 

You can change **where the column is added** (if you don't want it to be added at the last position):

```{r, eval=FALSE}
# add it before column "start"
mutate(presidential, 
    duration_days=end-start,
    .before=start)

# add it after column "end"
mutate(presidential, 
    duration_days=end-start,
    .after=end)
```

<br>
If you want to keep **only the newly created column(s)** (drop the remaining ones): use `transmute()` instead of `mutate()`:

```{r, eval=F}
transmute(presidential, 
    duration_days=end - start)
```

Re-assign to a new - or the same - data frame/tibble using the **usual R assignment operator: <-**

```{r, eval=T}
presidential2 <- mutate(presidential, 
                duration_days=end - start)
```

* If you want to change the name of a column, you can use the function `rename`:

```{r, eval=F}
# using the column name:
rename(presidential2, 
       President=name)

# or the column index:
rename(presidential2, 
       President=1)

# you can rename several columns using the same command:
rename(presidential2, 
       President=name, 
       Political_party=party)
```


## select

`select` will select (and optionally rename) columns/variables in a data frame / tibble.

Select column **"name"** only from the `presidential2` object:

```{r, highlight=c(1,3), echo=F, warning=F,message=F}
library(kableExtra)
knitr::kable(presidential2) %>% 
  kable_styling(bootstrap_options = "striped", font_size = 14) %>% 
  column_spec(1, background = "yellow")
```

```{r, eval=F}
select(presidential2, 
       name)
```

Select 2 columns: **party** and **name** (in that order):

```{r, eval=F}
select(presidential2, 
       party, name)
```

Rename a column as you select it:
```{r, eval=F}
select(presidential2, 
       party, President=name)
```

Select all columns **except** party with the **-** sign:
```{r, eval=F}
select(presidential2, 
       -party)
```

Select all columns between **start** and **party** (with both columns included)
```{r, eval=T}
 select(presidential2, 
        start:party)
```

### select_if

Select only columns containing characters with **select_if()**:

```{r, eval=T, echo =T}
# select columns containing characters:
select_if(presidential2, 
    is.character)
```

Select only columns containing a **date** with the `is.Date` function from the `lubridate` package:

```{r, eval=TRUE, echo=TRUE}
select_if(presidential2, 
      lubridate::is.Date)
```

<center><h4 style="background-color: #a4edff; display: inline-block;">**HANDS-ON**</h4></center>

For the next hands-on, we will use th `starwars` dataset: it contains information about the Star Wars movie characters:

```{r, echo=F, warning=F,message=F}
library(kableExtra)
knitr::kable(starwars[1:5,1:9]) %>% 
  kable_styling(bootstrap_options = "striped", font_size = 14)
```


* Create a new column **BMI** that contains the **BMI** of each character (Body Mass Index, calculated as `weight in kg / (height in m)^2`: we will assume that the `height` column is expressed as **cm** and the `mass` column is expressed as **kg**)
* Rename column `name` to `character_name`.
* Remove columns `vehicles` and `starships`.
* Save all the changes into the new tibble `starwarsBMI` (use the `%>%` !).

<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=FALSE}
# Create a new column BMI that contains the BMI of each character
mutate(starwars, BMI=mass/(height*0.01)^2)

# Rename column name to character_name.
rename(starwars, character_name=name)

# Remove columns `vehicles` and `starships`.
select(starwars, -c(vehicles, starships))

# Save all the changes into the new tibble starwarsBMI (use the %>% operator)
starwarsBMI <- starwars %>% 
  mutate(BMI=mass/(height*0.01)^2) %>% 
  select(-c(vehicles, starships)) %>%
  rename(character_name=name)
```

</details>

### select using patterns

Some [select helpers](https://tidyselect.r-lib.org/reference/starts_with.html) are available, that help you select columns given certain **patterns** in their names:

```{r, echo=FALSE, eval=TRUE}
knitr::kable(
  data.frame(name=c("starts_with", "ends_with", "contains", "matches", "num_range"), 
             description=c("starts with a prefix", "ends with a suffix", "contains a literal string", "matches a regular expression", "matches a numerical range like x01, x02, x03")), caption = 'select helpers',
  format = "html", table.attr = "style='width:60%;'"
)
```

<br>
For example, select only columns from the `starwars` dataset which name **end with** "color":

```{r}
select(starwars,
       ends_with("color"))
```

Or which start with the letter **h**:

```{r}
select(starwars,
       starts_with("h"))
```

If you are familiar with **regular expressions**, you can also use them within the `matches` function:

```{r}
select(starwars,
       matches("^h")) # same as starts_with("h")
```

Finally, you can select columns which name match a **numerical range** with `num_range`.
<br>
For example, let's take the `billboard` dataset that contains column names wk1, wk2, wk3 ... up to wk76, and select only columns from wk18 to wk22:

```{r}
select(billboard,
       num_range("wk", 18:22))
```

## filter

`filter()` is used to filter rows in a data frame / tibble.

Keep rows if **Democratic** is found in column `party`:

```{r, highlight=c(1,3), echo=F, warning=F,message=F}
knitr::kable(presidential) %>% 
  kable_styling(bootstrap_options = "striped", font_size = 14) %>% 
  row_spec(c(2:3, 6, 9, 11), background = "yellow")
```

```{r, eval=T}
 filter(presidential, 
        party=="Democratic")
```

You can filter using several variables/columns:
```{r, eval=F}
filter(presidential, 
       party=="Republican", name=="Bush")

# This implicity uses the "&", i.e. the fact that both conditions have to be TRUE
filter(presidential, 
       party=="Republican" & name=="Bush")

# Any logical operators can be used
filter(presidential, 
       name %in% c("Bush", "Kennedy"))
```

The same can be used for numerical values: let's select all rows from `table5` where `century > 19`:

```{r}
filter(table5,
       century > 19)
```

<center><h4 style="background-color: #a4edff; display: inline-block;">**HANDS-ON**</h4></center>

Going back to our previously create starwarsBMI data frame:

* How many characters have a **BMI** > 30?
* How many characters have a **BMI** > 30 <u>AND</u> are **Droids** ("Droid" in column `species`)?
* From the previous selection (BMI > 30 and Droid), select columns BMI, character_name, height and mass, and save in the new object `DroidBMI30`.

<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=F}
# How many characters have a BMI > 30?
filter(starwarsBMI, BMI > 30)

# How many characters have a BMI > 30 AND are Droids ("Droid" in column "species")?
filter(starwarsBMI, BMI > 30 & species=="Droid")
  
# From the previous selection (BMI > 30 and Droid), select columns BMI, character_name, height and mass, and save in the new object DroidBMI30.
DroidBMI30 <- starwarsBMI %>% 
            filter(BMI > 30 & species=="Droid") %>% 
            select(BMI, character_name, height, mass)
```

</details>

## summarise & group_by
 
`summarise` collapses a data frame to a **1-row tibble** (base R equivalent of **aggregate()**)
 
* Get average length of terms:

```{r, eval=T}
summarise(presidential2, 
          mean(duration_days))
```

* Get average length of terms + count of the total of entries:

```{r, eval=T}
summarise(presidential2,
          mean(duration_days), 
          n())
```

* You can also give a name to each of the calculations you produce with `summarise` (and add more calculations!):

```{r, eval=T}
summarise(presidential2,
          mean_term=mean(duration_days), 
          min_term=min(duration_days),
          max_term=max(duration_days),
          count_presidents=n())
```

* You can combine `summarise` with `group_by` to get, for example, the number of presidents per political party:

  * `group_by` defines a grouping based on existing variables.
  * `summarise` then processes the command based on the grouping

```{r}
# group based on the "party" column (that contains "Democratic" or "Republican")
groups <- group_by(presidential2,
                   party)

# count the number of presidents per party:
summarise(groups, 
          n())

# One line, using the %>% operator:
group_by(presidential2, 
         party) %>% 
      summarise(n())
```

> Note: the row above is equivalent to using `count` (a wrapper): `count(presidential, party)`

```{r}
count(presidential2, party)
```

* Use the same structure to calculate **the average length of terms per political party**:

```{r}
group_by(presidential2, 
         party) %>% 
  summarise(mean(duration_days))
```

* Note that you can **group using more than one variable**:

```{r, eval=F}
group_by(starwars, 
         species, hair_color, gender) %>%
      summarise(n())
```

* Grouping variables also **influences how other `dplyr` functions work**!
<br>
  
For example, let's group our `starwars` characters by both **species and gender** variables:

```{r, eval=TRUE}
sw_sg <- group_by(starwars, 
                  species, gender) 
```

We can then use `slice_max` (function that retrieves the row that contains the **maximum value** in a selected variable) to retrieve the character with the **maximum `height`**:
 
```{r, eval=TRUE}
sw_sg %>% 
  select(name, species, gender, height) %>% # columns selection just to make the output more readable
  slice_max(height)
```

We get one entry for **each unique combination of species and gender**.
<br>
If you query the maximum height on an **non-grouped tibble**, you will get only one row (maximum height overall):

```{r, eval=TRUE}
starwars %>% 
  select(name, species, gender, height) %>%
  slice_max(height)
```

* Another example is that you can create new columns (with `mutate`) **based on the grouping**.
<br>

  * Here we are grouping the data by `species` and we add a column `average_height_species` that describes the **average height**. 
  * As the data is grouped by `species`, we will get the **average height PER SPECIES**:

```{r}
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE))
```


<center><h4 style="background-color: #a4edff; display: inline-block;">**HANDS-ON**</h4></center>

Back to our `starwarsBMI` tibble:

* Count the number of occurrences of each `hair color` per `gender`.
* Count the average BMI per `species`. Add a count of the number of individuals per species.

<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=F}
# Count the number of occurrences of each `hair color` per `gender`
starwarsBMI %>% 
  group_by(gender, hair_color) %>% 
  summarise(mycounts=n())

# Count the average **BMI** per `species`.  Add a count of the number of individuals per species.
starwarsBMI %>% 
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_individuals=n())

# Also report the number of individuals for which BMI is NOT NA (i.e. the actual number of individuals for which the average was computed)
starwarsBMI %>%
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_all_individuals=n(), count_individuals_non_NA=sum(!is.na(BMI)))
```

</details>

### `ungroup`

When you are grouping variables with `group_by`, the tibble will **keep the grouping until you ungroup**!

<br>
While this is not an issue when you are summarizing (you get a summary table), it can be useful in case you are using the grouping - for example - to **create a new column**.
<br><br>
In the example stated before, we created a new column `average_height_species` that contains the **average height per species**:

```{r, eval=TRUE}
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>% head(n=2)
```

What if we also want to add a column that describes the **average mass of ALL individuals** (<u>regardless of the species</u>)? 
<br>

```{r, eval=TRUE}
# With the current grouping (species), you get the average mass calculated per species
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
  mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
```

Column `average_mass` still contains the average per species!

<br>
* We need to **ungroup** the tibble before creating this new column!!

```{r, eval=TRUE}
# ungroup first, and you get the average mass calculated for the whole tibble
starwars %>% group_by(species) %>% 
  select(name, species, height, mass) %>% 
  mutate(average_height_species=mean(height, na.rm=TRUE)) %>%
  ungroup %>%
  mutate(average_mass=mean(mass, na.rm=TRUE)) %>% head(n=2)
```


## arrange

`arrange` orders the rows of a data frame **by the values of selected columns**.
<br>
Let's order rows by increasing mandate duration:

```{r, eval=T}
arrange(presidential2, duration_days)
# decreasing order with arrange(presidential2, desc(duration_days))
```

You can use several columns for the sorting
```{r, eval=T}
arrange(presidential2, 
        duration_days, name)
```

If a grouping was done before, you can arrange first by grouping and then by selected variable(s) setting the `.by_group=TRUE` parameter:

```{r}
presidential2 %>%
    group_by(party) %>% 
    arrange(duration_days, .by_group=TRUE)
```
    
<center><h4 style="background-color: #a4edff; display: inline-block;">**HANDS-ON**</h4></center>

Go back to the previous exercise: "count the average **BMI** per `species`.  Add a count of the number of individuals per species." (on the `starwarsBMI` data set):

```{r, eval=FALSE}
starwarsBMI %>% 
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_individuals=n())
```

* Keep only species that have **2 or more individuals**.
* Arrange by decreasing **average BMI**.

<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=F}
starwarsBMI %>% 
  group_by(species) %>% 
  summarise(average_bmi=mean(BMI, na.rm=TRUE), count_individuals=n()) %>%
  filter(count_individuals >= 2) %>%
  arrange(desc(average_bmi))
```

</details>

## A few more useful `dplyr` functions

### Mutating joins

The following functions allows one to **join / merge** 2 tibbles into 1 using columns that contain **common keys**.

```{r, echo=FALSE, eval=TRUE}
knitr::kable(
  data.frame(name=c("**`inner_join`**", "**`left_join`**", "**`right_join`**", "**`full_join`**"), 
             `what it does`=c("includes all rows in x and y (intersection)", "includes all rows in x", "includes all rows in y", "includes all rows in x or y (union)")), caption = 'mutating joins functions',
  format = "html", table.attr = "style='width:90%;'"
)
```

Let's create 2 small tibbles:

```{r, eval=TRUE, message=F}
tibX <- tibble(ID=LETTERS[1:4],
               year=c("2020", "2021", "2021", "2020")
)

tibY <- tibble(ID=LETTERS[3:5],
               month=c("January", "October", "July")
               )
```

We will join `tibX` and `tibY` using the **ID** column, and keep only rows that contain a **matching ID** with `inner_join`:

```{r}
inner_join(x=tibX,
           y=tibY,
           by="ID")
```

Keep all rows from `tibX` regardless on whether they have a match in `tibY` with `left_join`:

```{r}
left_join(x=tibX,
           y=tibY,
           by="ID")
```

Keep all rows from `tibY` regardless on whether they have a match in `tibX` with `right_join`:

```{r}
right_join(x=tibX,
           y=tibY,
           by="ID")
```

Keep all rows from both tibbles with `full_join`:

```{r}
full_join(x=tibX,
           y=tibY,
           by="ID")
```

Note that columns do **NOT** need to be named the same way!
<br>
Let's consider the new tibble `tibZ`:

```{r, eval=TRUE}
tibZ <- tibble(id=LETTERS[3:5],
               month=c("May", "June", "April")
               )
```

We can join it with `tibX` by giving the **"by"** parameter a **named** vector that contains **1 element**:

```{r}
full_join(x=tibX,
           y=tibZ,
           by=c("ID" = "id")
)
```

<center><h4 style="background-color: #a4edff; display: inline-block;">**HANDS-ON**</h4></center>

Join the 2 following tibbles (keep all rows from the `mynames` tibble):

```{r}
mynames <- tibble(name=c("Einstein", "Newton", "Curie", "Mendel", "Franklin"),
                  birth_year=c(1879, 1643, 1867, 1822, 1920))
  
myemails <- tibble(full_name=c("Albert Einstein", "Isaac Newton", "Marie Curie", "Rosalind Franklin"),
                               email_address=c("aeinstein283@coolmail.com", "isaac.newton.scientist@coolmail.com", "mariecurie007@coolmail.com", "rosalind2_franklin@coolmail.com"))
```


<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=F}
# need to separate a column first!
myemails %>% separate(col=full_name,
                      into=c("first_name", "last_name"),
                      sep=" ") %>%
            right_join(y=mynames,
                       by=c("last_name" = "name"))
```

</details>


### Extract or remove rows with `slice`:

* Extract rows:

```{r, eval=F}
# Fetch the first 2 rows (index 1 and 2)
slice(presidential2, 
      1:2)
```

* Remove rows:

```{r, eval=F}
# Remove 1rst and 4th rows
slice(presidential2, 
      -c(1,4))
```

The **`slice` helpers** can be useful:

```{r, echo=FALSE, eval=TRUE}
knitr::kable(
  data.frame(name=c("slice_min", "slice_max", "slice_head", "slice_tail", "slice_sample"), 
             usage=c("select rows with lowest values of a variable", "select rows with highest values of a variable", "select the first rows", "select the last rows", "randomly select rows")), caption = '`slice` helpers',
  format = "html", table.attr = "style='width:70%;'"
)
```

Extract the row that has the **maximum** `height` from in the `starwars` dataset with `slice_max`:

```{r}
# by default, only 1 row is extracted
slice_max(starwars, 
          order_by=height)

# set parameter "n" if you want to extract the "n" top rows
slice_max(starwars, 
          order_by=height,
          n=3)
```

Same for the minimum with `slice_min`:

```{r}
slice_min(starwars, 
          order_by=mass,
          n=2)
```

Extract the first or last row with `slice_head` and `slice_tail`, respectively

```{r}
# first row
slice_head(starwars)

# last row
slice_tail(starwars)
```

You can extract the **"n"** first or last rows, or you can extract a certain **proportions of rows to select** with **"prop"**:

```{r, eval=FALSE}
# first 5 rows
slice_head(starwars, 
           n=5)

# first 10% of the rows
slice_head(starwars, 
           prop=0.1)

# last 7 rows
slice_tail(starwars, 
           n=7)

# last 25% of the rows
slice_tail(starwars, 
           prop=0.25)
```

Select a random sample of rows with `slice_random`:

```{r}
# 1 random row
slice_sample(starwars)

# 4 random rows
slice_sample(starwars, 
             n=4)

# 3% of rows selected randomly
slice_sample(starwars, 
             prop=0.03)
```

Note that the `slice` functions can be combined with a `grouping`.

* With `slice_max` you would get the maximum per group (with `slice_min`, the minimum per group):

```{r}
presidential2 %>% 
  group_by(party) %>%
  slice_max(order_by = duration_days)

presidential2 %>% 
  group_by(party) %>%
  slice_min(order_by = duration_days)
```

### Extract a single column as a vector with `pull`:

```{r, eval=T}
# extract column "duration_days"
presidential2 %>% pull(duration_days)

# extract column "duration_days" as a vector, and name the vector using the "name" column
presidential2 %>% pull(duration_days, 
                       name=name)
```

### Change column order with `relocate`:

```{r, eval=F}
# move column "party" as the start
relocate(presidential, 
         party)

# move column "name" before column "party"
relocate(presidential, 
         party, 
         .before=party)

# move column "name" at the end (after last column)
relocate(presidential, 
         name, 
         .after=last_col())

# move around all columns
relocate(presidential, 
         party, start, name, end)

# rename a column as you relocate it
relocate(presidential, 
         President=name, 
         .after=last_col())

# reorganize columns alphabetically
relocate(presidential, 
         sort(tidyselect::peek_vars()))
```

## Exercise

We will work with a modified version of the **storms** data set: positions and attributes of **198 tropical storms**, measured every 6 hours.

1. Download and read in [this file](https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/modified_storms.csv) (using a `tidyverse` function!):
  * store the dataset into object `mystorms`, and then **tidy** it!
2. What storm has the **highest median wind speed**?
3. Calculate **how many storms happen each year**. You might need to `separate` a column... And check how the `distinct` function can help you!
  * What are the years with the **maximum** number of storms?

<details>
<summary>
<h5 style="background-color: #a4edff; display: inline-block;">*Answer*</h5>
</summary>

```{r, eval=F}
# 1. download, read in, tidy
mystorms <- read_csv("https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/modified_storms.csv") %>%
            separate(col=wind_and_pressure, into=c("wind", "pressure"), sep="-", convert=TRUE) 

# 2. What storm has the highest median wind speed?
mystorms %>% group_by(name) %>%
            summarise(median_wind = median(wind)) %>%
            slice_max(order_by=median_wind)
      
# 3. Calculate how many storms happen each year: 
mystorms %>%  separate(date, into=c("year", "month", "day"), sep="-") %>%
          distinct(name, year) %>%
          group_by(year) %>%
          summarise(storms_per_year=n())
  #  What are the years with the **maximum** number of storms?
mystorms %>%  separate(date, into=c("year", "month", "day"), sep="-") %>%
          distinct(name, year) %>%
          group_by(year) %>%
          summarise(storms_per_year=n()) %>%
          slice_max(order_by=storms_per_year, n=5)
```

</details>