week12.qmd

---
title: 'Visualizing lots of data'
author: "Jeremy Van Cleve"
date: 19 11 2024
format: 
  html:
    self-contained: true
---
  
# Outline for today

- Grids of plots
- Principal component analysis (PCA) plots

# Reminder

- Schedule for lightning talks

# Plotting data with many variables

In many of the datasets that we've seen so far, we've already been faced with the problem that the data could be visualized in many different ways because they have many different variables. Some datasets were time series and it was clear that scatter plots or line plots were useful with time on the x-axis and multiple variables on the y-axis. For example, these global COVID-19 data from Our World in Data (<https://github.com/owid/covid-19-data/tree/master/public/data>) for the year 2021
```{r}
#| message: false
library(tidyverse)
theme_set(theme_classic())
```
```{r}
owid = read_csv("https://catalog.ourworldindata.org/garden/covid/latest/compact/compact.csv")
owid
owid |> 
  filter(country %in% c("United States", "Brazil", "India", "United Kingdom", "Sweden")) |>
  filter(year(date) == 2021, !is.na(new_deaths_smoothed_per_million)) |>
  ggplot(aes(x=date, y=new_deaths_smoothed_per_million)) +
  geom_line(aes(color=country)) + 
  labs(x = "Date", y = "Deaths per million persons (smoothed)", color = "Country")
```
are presented nicely as a nice time series since the time course of the pandemic is a crucial feature that we might be interested in studying.

However, our plots may have many variables and no single one of them, like time, stands out as one against which we would like to compare the others. In this case, how the variables correlate with one another may be important. For example, let's load a dataset from Keith Tarvin of Oberlin College on morphological measurements of blue jays.

```{r}
load("blue_jays.rda")

glimpse(blue_jays)
```

There are 123 individuals in this dataset and the variables include bill depth, width, length, head size, mass, sex, etc. If we wanted to see how head size and body mass correlate, we could simply do a scatter plot.

```{r}
blue_jays |> ggplot(aes(x=Mass, y=Head)) + geom_point()
```

It looks like head size and mass correlate positively, which makes sense. We know there are males and females, so we can add another variable by coloring the points by sex.
```{r}
blue_jays |> ggplot(aes(x=Mass, y=Head)) + geom_point(aes(color=KnownSex))
```

## Grids of plots

Ok, so male birds are larger and heavier in this population. But there are a bunch of other variables too; how do they correlate with mass and head size and with each other? How do they differ with sex? In other words, we would like to make a scatter for each pair of variables in the dataset. To do this, we can use the `facet` function like we did earlier in the semester but now in a more sophisticated way. First, a little reminder about how the `facet` function works in `ggplot`. Let's look back at the time that we used `facet_wrap` in Week 8.

```{r}
library(gapminder)

usukjpcn = filter(gapminder, 
            country == "United States" 
            | country == "United Kingdom" 
            | country == "Japan" 
            | country == "China" )

ggplot(data = usukjpcn) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
  facet_wrap(~ country)
```

We first wrote `ggplot` commands like we wanted to plot `year` vs `lifeExp`. Then, we used the `facet_wrap` function and told it we wanted a plot for each distinct value of the `country` variable. The output was then four plots, one for each country, with the same x- and y-axis. This kind of setup actually only plots three variables, `year`, `lifeExp`, and `country`, and does not do a scatter plot for each pair of variable. However, with a little data wrangling, we can create a new data table that will work with `facet_grid`. 

Looking at the example above, we can surmise that `facet_wrap` and `facet_grid` generate multiple plots with the same x- and y-axis. Thus, if we would a grid of plots where each plot has a different combination of variables for the x- and y-axis, then we need to collapse our data for the different morphological measurements into two sets of two columns, one set for the variables that will go on the x-axis and one set for the variables that will go on the y-axis. Recall from our lecture on tidying data that we can do this with `pivot_longer`.
```{r}
bgx = blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
  pivot_longer(Head:Skull, names_to = "var_x", values_to = "val_x")

bgy = blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
  pivot_longer(Head:Skull, names_to = "var_y", values_to = "val_y")

bgx
bgy
```
In the table above, we've selected only four variables, sex, head size, mass, and skull size. In order to create scatter plots for head size, mass, and skull size, we gathered them together where `var_x` tells us which variable we have and `val_x` the value of that variable. Likewise, we created an identical table with `var_y` and val_y` since we're going to plot each variable on the x- and y-axis. Next, we just join the two tables together.
```{r}
bg = inner_join(bgx, bgy) # we use `inner_join` but `full`, `left`, or `right` will do the same thing
bg
```
The join is deceptively simply so let's break it down a little bit. Its a "natural" join since we don't give it a `by = join_by` and it selects the common columns, which are `BirdID` and `KnownSex`. Since each `BirdID` is unique in the `blue_jays` table (you can check this with `count`), we can really just think the join is done on `BirdID`. In both `bgx` and `bgy`, each `BirdID` has has three rows, one for each of the three variables head, mass, and skull. The join function looks at the key variable, `BirdID` and finds rows in both tables that share values of `BirdID`. It then adds to the output table one row for each unique combination of the remaining variables in the two tables, which are the `var_x`, `val_x`, `var_y`, and `val_y` variables. Thus, each `BirdID` value now has 3 x 3 = 9 rows in the output table.

With this joined table, we now can plot the grid of plots with `facet_grid(vars(var_y), vars(var_x))`
```{r}
bg |> 
  ggplot(aes(val_x, val_y, color = KnownSex)) + 
  geom_point() + 
  # we put rows = rows = vars(var_y) and cols = vars(var_x) so that it calculates the scales correctly; 
  # the fixed variable in a column, var_x, should vary on the x-axis and likewise for the rows and the y-axis.
  facet_grid(rows = vars(var_y), cols = vars(var_x), scales = "free") +
  labs(x = NULL, y = NULL)
```

Viola! We can see now that males are generally larger across all these variables, and that they all positively correlate with each other. In ecology, these kind of correlations with different morphological measurements is studied in the field of "allometry".

Of course, we're not the first ones to want to do this kind of plot. There are a few packages which perform this kind of plot automatically, such as `GGally` whose function `ggpairs` adds some nice things.
```{r}
library(GGally)

blue_jays |> 
  select(KnownSex, Head, Mass, Skull) |> 
  ggpairs(aes(color=KnownSex))
```
However, even with standard `ggplot`, we can still get more than 1/2 to what this fancy package does!

## Principal component analysis (PCA) plots

Recall that the blue jay dataset has more than just the three variables above. What if it had 10 more variables? Or 100 more? We couldn't easily interpret a 100x100 grid of scatter plots to understand how the variables relate to one another or whether there are different patterns for males and females. A very useful statistical tools used in these circumstances is called a *principal component analysis*, which is a type of *dimensionality reduction*. Dimensionality reduction uses the fact that there are often lots of correlations in high-dimensional data and thus there are "effectively" many fewer important dimensions that capture the independent variation of the whole dataset. For example, in the blue jay morphology data, we can see that many body measures are positively correlated. Thus, if we know just body mass, we can make a good guess about what head size or skull size might be. We could even do a linear regression of head size against body mass and skull size against body mass and then use body mass to predict both of those variables. If we do this kind of linear analysis, adding together variables and weighting them with coefficient, and make sure the weight sum of variables explains as much variation in the data as possible, we end up with a PCA. A PCA not only gives you a new variable that explains a lot of variation, it also gives you as many new variables as you had old variables where each new one after the first explains less variation in the data than the new variable before it. 

Let's see an example of PCA. We can use the `prcomp` function in R to generate the principal components.

```{r}
bpca = blue_jays |> select(-BirdID, -KnownSex, -Sex) |> prcomp()

cbind(blue_jays, bpca$x) |> ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=KnownSex))
```

This plot shows all the bird samples plotted on the two first principal components. You can immediately see that males and females are roughly two different clouds of points, which accords with our prior knowledge that the sexes differ in body morphology. To see how much of the variance each PC explains, we look at the `summary`.

```{r}
summary(bpca)
```
The first component explains almost 90% of the variance and the second component only 8%. To see how each PC is composed of the underlying variables, we look at the `rotation` value:
```{r}
bpca$rotation
```
This shows that the first PC is mostly body mass and that other variables are also positively correlated with body mass (PCs can be all multiplied by a negative, so what matters is the relative sign). We can also plot what the variables are as a function of the first two PCs.

```{r}
as_tibble(bpca$rotation) |> 
  mutate(feature=rownames(bpca$rotation)) |> 
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=feature), show.legend = FALSE) + 
  geom_text(aes(label=feature), size=3, position=position_jitter(width=0.05,height=0.05))
```

This just shows that mass dominates PC1 whereas PC2 measures how the rest of the variables all negatively correlate with mass once you take into account PC1.

Finally, let's do a PCA for data that doesn't have such a clear group as sex. One example are the gene expression data from the genomic imprinting dataset. Here, we'll do a PCA using the `Genes` as variables. This entails using `pivot_wider` to put the `Genes` as variables or columns (which of course is how the original `.xlsx` file was, but we're just using the tidy data as practice)

```{r}
imprint = read_csv("babak-etal-2015_imprinted-mouse_tidy.csv")

imprint_t = imprint |> 
  pivot_wider(names_from = Genes, values_from = expression) |> 
  na.omit()
ipca = imprint_t |> 
  select(-tissue) |> 
  prcomp()

as_tibble(ipca$x) |> 
  add_column(tissue = imprint_t$tissue, .before = 1) |>
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=tissue), size=2.5, position=position_jitter(width=2,height=5))
```

We can see from the PCA that some tissues group together naturally (brain tissues) and others not as much. This could be the beginning of an investigation about whether there are biological reasons these other tissues groups the way they do. We can use summary to see how strong the first few PCs are.

```{r}
summary(ipca)
```

The first PC only captures 21% of the variance and the second captures 12%. Thus, we need more the PCs in this dataset to describe a significant amount of the variance. Finally, we can plot how the genes make up the first two PCs. Then, we could use this information to see if the biological function of these genes explains how the tissues group along PC1 and PC2.

```{r}
as_tibble(ipca$rotation) |> 
  add_column(gene = rownames(ipca$rotation), .before = 1) |>
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=gene), size=2.5, position=position_jitter(width=0.015,height=0.03))
```

## Non-linear dimensionality reduction

PCA is a "linear" dimensionality reduction since it is just a weighted sum of the original variables. Other dimensionality reduction methods combine the variables together in a non-linear way and have become popular in molecular biology as a way of condensing gene expression data. Two particularly common ones are UMAP and t-SNE. Here is an example of PCA vs t-SNE:

![Fig 2 from Kobak and Berens 2019[^1]](assets/fig2_kobak_berens_2019_NatComm.jpg)

These have some advantages in terms of their visual appeal, but they should be used carefully[^1] and can produce spurious results if used incorrectly[^2].

[^1]: Kobak, D., and P. Berens. 2019. The art of using t-SNE for single-cell transcriptomics. Nature Communications 10:5416.
[^2]: Chari, T., and L. Pachter. 2023. The specious art of single-cell genomics. PLOS Computational Biology 19:e1011288.

# Lab ![](assets/beaker.png)

### Problems

1.  Using the `gapminder` data:
    - Perform a principal components analysis using the variables `lifeExp`, `year`, `pop`, and `gdpPercap`. 
    - Plot all the points on a scatter plot with the the x- and y-axis corresponding to the first two principal components (PC1 and PC2). 
    - Color the points according to `year`. Set the shape of the point (square, circle, etc) according to `continent`.
    - What patterns do you observe if any?

2.  For this question, we will use genetic data from a few European populations gathered as part of the CEPH-Human Genome Diversity Panel. A little more information on the data can be found here: <https://github.com/NovembreLab/HGDP_PopStruct_Exercise/blob/master/PopGenWorkshop.pdf>. The file can be loaded using the code below. The `ID` column is an identifying string for the individual, the `Group` column gives the individual's geographic ancestry, and the remaining columns are SNPs.
    ```{r}
    hgdp = read_csv("H938_Euro.LDprune.csv")
    ```
    - Perform a PCA on the SNPs.
    - Plot all the individuals using the first two principle components.
    - Color the individuals according to `Group`.
    - Which PC corresponds to a more north-south axis and which one to a more east-west axis?
    
3.  The scatter plot grid we created for the blue jay data had many extraneous plots (the diagonal plots and two copies of each plot, x vs y and then y vs x). There were only three scatter plots that we really needed from that 3x3 grid. Using `facet_wrap`, we can plot only these three plots. 
    - Begin with the `bg` data table that we used that contains the `var_x`, `var_y`, `val_x`, and `val_y` columns.
    - Create one new column that contains the x,y variable combo for each scatter plot.
    - Use `filter` to get only the rows that correspond to the three variable combinations you need.
    - Create scatter plots using `facet_wrap` on the new column you created.
    - +2 points: use `labeller` in `facet_wrap` to add the correct x and y axis labels for each plot.