week05.qmd

---
title: 'Tidy Data'
author: "Jeremy Van Cleve"
date: 24 09 2024
format: 
  html:
    self-contained: true
---

# Outline for today

- So many data formats, so little time
- One format to rule them all: tidy data
- Making tidy data
- Slicing tidy data with `dplyr`

# So many data formats, so little time

> Happy families are all alike;  
> every unhappy family is unhappy in its own way.
>
> Leo Tolstoy (first line of "Anna Karenina")

Hadley Wickham points out the nice connection between the above quote and data formats [^1]. While data formatting might seem esoteric and boring, paying attention to it early on can pay off a great deal in the long term, which hopefully leads to a happy scientist. If one were to break data formatting into two phases, they might be these:

    Phase 1. The literal production of data itself from your experiment since you must choose a format and medium in which to record the data. 
    
    Phase 2. The input and formatting of data into your analysis tool (e.g., R). This might involve minimal changing of the "format" or a great deal of reshaping of the data.
    
Working backward, you do the least work on data formatting if you choose to save your data initially in a format that is easiest to analyze with R. Since we will use the graphing package, `ggplot2`, by Hadley Wickham, we will use the data format, **tidy data**, that he advocates. Lest you worry that this is too Hadley specific, or even R specific, this kind of format is pretty popular now as a way to "layer" or "compose" plots that use different graphical elements to represent different variables in the data (e.g., `plotly` and `seaborn` in Python and `AlgebraOfGraphics` and `TidierPlots.jl` in Julia). To get a sense for what this format is and why it might be useful, consider some different ways to organize data on tuberculosis (TB) cases where you have three variables, *country*, *year*, and *population*, for each measurement of *cases*. First, each variable including *cases* could have its own column:
```{r}
#| warning: false
library(tidyverse)
table1
```
Here, each row represents one time you measured the TB cases. Alternatively, you could have each row represent a country with columns for different years. This means you need a table that measures the *cases* and one that measures the *population*:
```{r}
table4a # cases
table4b # population
```

To see why the this format might get cumbersome, you just have to suppose you want to add more variables: each new variable requires a new table. If you are interested in just one variable and how it changes across country and year, you are fine. But if you want to look across variables, say correlate population and caseload over time, then you have to slice two different tables. If you want to manipulate three variables, then you have to manipulate three different tables, etc. With the first format, we can include all the variables in a single table. The first format is the **tidy** format. 

# One format to rule them all: tidy data

The tidy format has the following three rules[^1]:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Visually, this looks like
![](assets/tidy.png)

The primary benefit of tidy data is that every variable is a vector (column), which means that slicing data is just slicing columns. Though this may become complicated when the number of variables is large, there are some helper functions that will make slicing tidy data much easier. The slicing functions come from the package `dplyr` and the plotting package `ggplot2` assumes tidy data.

To give a flavor of the power of tidy data and `ggplot2`, we'll look at an example. The "GWAS Catalog" (<https://www.ebi.ac.uk/gwas/home>) keeps track of the SNPs associated with traits in genome-wide association studies (GWAS). We can download these data, take a reasonable subset to play with (the 100 diseases with the most SNPS), and plot some things. For example, here are SNPs associated with breast, colon, and prostate cancer:
```{r}
#| message: false
#| warning: false
# https://www.ebi.ac.uk/gwas/docs/file-downloads
# we'll load a subset of these data; namely, the 100 diseases with the most SNPs in the database once we remove those associated with some common body/weight and education phenotypes
# to clean things up a little, we filter out rows where the CHR_POS isn't a single number; this way, the x-axis label is nice automatically.

gwas = read_tsv("gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv", col_types = cols(CHR_POS = col_number()))

filter(gwas, `DISEASE/TRAIT` == 'Prostate cancer' | `DISEASE/TRAIT` == "Colorectal cancer" | `DISEASE/TRAIT` == "Breast cancer") |>
  ggplot(aes(x=CHR_POS)) + 
  geom_point(aes(y=PVALUE_MLOG, color=`DISEASE/TRAIT`))
```

# Making tidy data

Often, data will not be in the tidy format by default, so it will be necessary to format it. The first step is to figure out what the "variables" and "observations" are. Sometimes this may require carefully thinking about the experimental design used to create the data. Two common problems with data that are *untidy* are:

1. One variable might be spread across multiple columns.
2. One observation might be scattered across multiple rows.

Typically, a dataset will only suffer one of the above problems. The first problem is dealt with using the `pivot_longer` function and the second with the `pivot_wider` function.

## Making data "longer"

Recall that this table is tidy
```{r}
table1
```
whereas these two are not
```{r}
table4a
table4b
```
The problem with `table4a` and `table4b` is that the variable *year* is spread across multiple columns. Thus, you need to gather it together into a single column, which makes the data frame longer. The columns `1999` and `2000` will be collected as values of the variable *year*, which we pass to the argument `names_to`. The first table contains the number of cases (which is our `values_to` argument):
```{r}
# note that backticks `` here. we need them since the column names start with a number 
# (i.e, this allows us to break normal R naming rulues!).

tidy4a = pivot_longer(table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases") 
tidy4a
```
and the second table contains the population size
```{r}
tidy4b = pivot_longer(table4b, cols = c(`1999`, `2000`), names_to = "year", values_to = "population") 
tidy4b
```
To join these two results together, you use the function `full_join`
```{r}
full_join(tidy4a, tidy4b)
```
which is the same as the first table
```{r}
table1
```
These "joins" are related to joins you do with databases if you know SQL. We'll talk more about this next week. 

## How does pivoting longer work?

To get a little closer work at how this works conceptually, we'll borrow an example from Hadley's R for Data Science book[^2]. Suppose we have the following blood pressure data from three individuals
```{r}
bpdf = tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)
```
where `id` refers to the individual and `bp1` and `bp2` are measurements of blood pressure at two different times. By the way, the `tribble` function is a handy way of creating a `tibble` data frame by hand.

We want to tidy table with three columns, `id`, which we already have, `measurement`, which will be either `bp1` or `bp2`, and the `value` of the measurement. To do this we use the code
```{r}
pivot_longer(bpdf, cols = c(bp1, bp2), names_to = "measurement", values_to = "value")
```
To see better how this reshaping works, think about first about the `id` column. We are keeping this column and its values need to be repeated in new rows, one for each additional variable being pivoted, which is one additional variable here.
![](assets/tidy_bp_variables.png){width=50%}
The column names become a new column, `measurement`, which need to be repeated once for each original row of the dataset.
![](assets/tidy_bp_column-names.png){width=50%}
The second new column captures the values, which are unwound row by row from the original table into the new table
![](assets/tidy_bp_cell-values.png){width=50%}

## Making data "wider"

The following table is *untidy*
```{r}
table2
```
because observations (i.e, every year) are spread across multiple rows. To tidy it, identify the column that names the variables, which in this case is `type`, and then the column with the values, which is `count`:
```{r}
pivot_wider(table2, names_from = type, values_from = count)
```

## How does pivoting wider work?

Let's look at the long alternate version of our blood pressure example.
```{r}
long_bpdf = tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)
long_bpdf
```
To pivot this wider, we could do
```{r}
pivot_wider(long_bpdf, names_from = measurement, values_from = value)
```
The `pivot_wider` function accomplishes this by looking for unique values in the `measurement` column since these will be the new columns. These values are
```{r}
unique(long_bpdf$measurement)
```
By default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the `id_cols`. Here, there is only one column, but in general there can be any number. The columns, `bp1`, `bp2`, and `bp3`, are now filled from the data in the original table and `NA` is inserted when there isn't a corresponding row for that new columan row combination.

## More pivoting wider and longer

There are other ways your data might be *untidy*

1. A single column actually contains two or more variables (like a ratio of two variables). In this case, the `separate` or `separate_wider_delim` functions may be used.
2. Multiple columns actually contain a single variable and need to be combined. In this case, the `unite` function is used.

To read more about these, see the "vingnette" (or code tutorial) in R by typing `vignette("pivot")`

# Slicing tidy data with `dplyr`

Now that you have made your data tidy, you probably want to slice and dice it. The `dplyr` package has handy functions for doing just this. These functions will making slicing **MUCH EASIER** than the base R way we've been doing it so far. Generally, `dplyr` is useful for doing the following five things

1. Pick observations (i.e., rows) by their values for specific variables: **`filter()`**.
2. Reorder or sort the rows: **`arrange()`**.
3. Pick variables (i.e., columns) by their names: **`select()`**.
4. Create new variables with functions of existing variables (or modifying existing variables): **`mutate()`**.
5. Collapse many values down to a single summary: **`summarize()`**.

Each of the functions above works in a similar way.

- The first argument is the data frame.
- The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using `$`.
- The result is a new data frame.

We'll use the GWAS data you loaded above to demonstrate each of the five tasks.

## Filtering rows with `filter()`

Let's look at the GWAS data with `glimpse` (which is from `tidyverse`). 
```{r}
glimpse(gwas)
```
Each observation (row) is a SNP from a GWAS study and we information on its location, the disease/trait associated, etc. Let's filter just SNPs for colorectal cancer
```{r}
filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
```
We could apply another filter by just adding an argument, such as having a -log(p-value) (higher values of this mean more significant!) of greater than 10
```{r}
filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer', PVALUE_MLOG > 10)
```
Note that `filter` combines consecutive arguments by default using the "&". You could equivalently give a single argument with the "&" to get the same slice
```{r}
filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer' & PVALUE_MLOG > 10)
```

## Arrange rows with `arrange()`

The function `arrange` just sorts the rows based on the columns you specify. For example, to sort the colorectal cancer SNPs by `DATE` of the study,
```{r}
colocancer = filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
arrange(colocancer, DATE)
```
and use `desc` to make the sort a descending one,
```{r}
arrange(colocancer, desc(DATE))
```

## Select columns with select()

The `select` function simple selects specific columns (i.e., variables). To get only the disease, chromosome position, and -log(p-value) from the full data,
```{r}
select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG)
```

## Add new variables with mutate()

You may want to add new variables that are functions of other variables. For example, you could create a column for the p-value from the `PVALUE_MLOG` (there is already one in the table, but we'll make another)
```{r}
sm_gwas = select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG, `P-VALUE`)
mutate(sm_gwas, new_p_value = 10^(-PVALUE_MLOG))
```

## Summaries with `summarize()`

The function `summarize` (Hadley prefers the British spelling, `summarise`, which I refuse to acknowledge as an American 😤) collapses the data frame to a single row:
```{r}
summarize(gwas, mean_mlog_p_value = mean(PVALUE_MLOG)) 
```
This isn't terribly useful until you use the `group_by` function to do the summarize action on data by "group", which you specify according to values of variables. To see the average for each disease, group by `DISEASE/TRAIT`,
```{r}
grouped_gwas = group_by(gwas, `DISEASE/TRAIT`)
summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
```
or add the journal the study was published in
```{r}
grouped_gwas = group_by(gwas, `DISEASE/TRAIT`, JOURNAL)
meanp = summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
meanp
```
Finally, let's take the mean across all diseases for each journal and then sort by p-value to see the pattern with the journal a little better
```{r}
arrange(summarize(group_by(gwas, JOURNAL), mean_mlog_p_value = mean(PVALUE_MLOG)), desc(mean_mlog_p_value))
```
Another nice function to use with `summarize` is the "counting" function `n()`, which can give the number of rows in each group. Since each row is a SNP, we can get the number of SNPs for each disease/trait this way:
```{r}
arrange(summarize(group_by(gwas, `DISEASE/TRAIT`), n_SNPS = n()), desc(n_SNPS))
```

# Using the `|>` operator to chain together functions

The slicing and summarzing operations we've been doing can quickly get messy as we added more operations. Since each operation is applied to the results of the previous one, it ends up with a bunch of very nested parentheses, which can be hard to read. The pipe operator `|>` allows you to easily chain together operations so that its easier to read them from left to right. For example, the previous `summarize` and `arrange` operations could be put together like
```{r}
gwas |>
  group_by(`DISEASE/TRAIT`) |>
  summarize(n_SNPS = n()) |>
  arrange(desc(n_SNPS))
```
Effectively, the pipe takes what is on the left side of it and uses it as the first argument in the function on the right side. In the above example, `gwas`, our initial data table, is on the left and it gets used as the first argument in `group_by`. The `group_by` function returns a `data.frame`, which is then used as the argument to the function of the right of the next pipe, which is `summarize`. The rule of thumb then is that if you want to convert a command without a pipe to one where you use a pipe, take the first argument out of the function, put it on the left of the function separated by the pipe. Likewise, to get rid of the pipe, take the input on the left and put it in the first argument of the function on the right of the pipe and delete the pipe. 

**A very significant advantage** of structuring your slicing / wrangling commands this way when you use RStudio is that RStudio will help you auto-complete column names. As you've probably seen already, typing these names can get cumbersome, so this is a big help. RStudio doesn't do this when you put the `data.frame` object in the function directly (for some reason unknown to me).


[^1]: Wickham, Hadley. 2014. J Stat Softw, 59:1--23. DOI: [10.18637/jss.v059.i10](http://dx.doi.org/10.18637/jss.v059.i10)
[^2]: Wickham, Hadley. 2023. R for Data Science (2e). O'Reilly Media. [website](https://r4ds.hadley.nz/)

# Lab ![](assets/beaker.png)

For these problems, use the functions introduced above from the `tidyverse` packages like `dplyr` as much as possible.

### Problems

1.  Using the GWAS data, 
    ```{r}
    library(tidyverse)
    gwas = read_tsv("gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv", col_types = cols(CHR_POS = col_number()))
    ```
    produce a data table that shows the **mean**, **maximum**, **minimum**, and **variance** of `PVALUE_MLOG` for each combination of disease and journal. Sort the table by the mean value of `PVALUE_MLOG`.

2.  Recall the gene expression last that was briefly introduced last week:
    ```{r}
    library(readxl)
    imprint = read_excel("babak-etal-2015_imprinted-mouse.xlsx", na = "NaN")
    ```
    
    Each row is a gene, each column is a tissue type, and each cell contains a gene expression measurement.

    **Make these data tidy (in one way) by collapsing the tissue columns and making the data table longer**
    
    You will need the `pivot_longer` function. The first trick here will be first to identify the "observations" and then the "names_to", which is the variable that changes across each observation. 
    
    The second trick is specifying the columns across which you need to gather together into a single column. A hint for the second trick is that you can specify the columns you don't want with the `-` operator or `!` (NOT) operator.
    
    The answer will be deceivingly simple! This is the elegance / frustration of R. 
    
3.  Using the tidy data from Problem 2, find the **number of genes** (across all tissue types) that have an **expression value <= 10 and  > 2**. These genes are "paternally imprinted" in the Babak et al. dataset, which means they are only expressed from the maternal copy of the gene. 

    Hint: you will need to count each gene once even if it appears in multiple rows, which can be done with the `distinct` function.

4.  Use `pivot_wider` to return the genomic imprinting tidy data from Problem 2 to their original wide format.

5.  We'll use some the CDC data on COVID-19 deaths for this problem
    ```{r}
    library(RSocrata) 
    covid19deaths = read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble() |> filter(group == "By Week")
    ```
   These data are tidy. Make a **wider** version of this table by making columns for each state and using the `covid_19_deaths` column as the values. You can discard all the other columns except `week_ending_date`, `covid_19_deaths`, and the columns for each state. You should have only one row for each date.