pca.Rmd

# Principal Components Analysis {#pca}

As always, let's start by loading the packages we need. 
  
```{r, results = 'hide', warning = FALSE, message= FALSE }
library(lipdR)
library(geoChronR)
library(magrittr)
library(dplyr)
library(purrr)
library(ggplot2)
```
  
This vignette showcases the ability to perform principal component analysis (PCA, also known as empirical orthogonal function (EOF) analysis. Data are from the [PalMod](pa) compilation. We'll just load a subset of it here, but the rest is available on the LiPDverse()    

```{r indianOceanData,cache=TRUE}
  FD <- lipdR::readLipd("http://lipdverse.org/geoChronR-examples/PalMod-IndianOcean/PalMod-IndianOcean.zip") 
```

## Let's explore these data!
### Make a map
First, let's take a quick look at where these records are located. geoChronR's `mapLipd` function can create quick maps:

```{r,results="hide",fig.keep="all"}
mapLipd(FD,map.type = "line",f = 0.1)
```

### Grab the age ensembles for each record. 
Now we need to map the age ensembles to paleo for all of these datasets. We'll use purrr::map for this, but you could also do it with sapply(). In this case we're going to specify that all of the age ensembles are named "ageEnsemble", and the chron and paleo depth variables. Fortunately, the naming is consistent across PalMod, making this straightforward.
```{r mapIOD,results="hide",message=FALSE}
FD2 = purrr::map(FD,
                 mapAgeEnsembleToPaleoData,
                 strict.search = TRUE,
                 paleo.depth.var = "depth_merged",
                 chron.depth.var = "depth",
                 age.var = "ageEnsemble" )
```


### Plot some summary statistics

In this analysis, we're interested in spatiotemporal patterns in the Indian Ocean during the Last Deglaciation. Not all of these records cover that time interval, so let's take a look at the temporal coverage. 

```{r}
indTib <- FD2 %>% extractTs() %>% ts2tibble() #create a lipd-timeseries-tibble

#use purrr to extract the minimum and maximum ages for each record
minAge <- map_dbl(indTib$age,min,na.rm = TRUE)
maxAge <- map_dbl(indTib$age,max,na.rm = TRUE)

#plot the distributions of the ages.
ggplot()+ geom_histogram(aes(x = minAge,fill = "Min age")) + 
  geom_histogram(aes(x = maxAge,fill = "Max age"),alpha = 0.5) +
  scale_fill_manual("",values = c("red","black")) +
  xlab("Age")
                   
```
OK, so it looks like the ages range from 0 to about 150. But what are the units on those?

:::: {.blackbox data-latex=""}
::: {.exercise #units}
Explore the LiPD data to determine the units on the ages, and update your histogram appropriately. 
:::
::::


### Plot the data
Let's start by making a stack plot of the surface temperature data in this dataset, to see what we're working with.

:::: {.blackbox data-latex=""}
::: {.exercise #timeseriesPlot}
Create a "stack plot" of the Surface Temperature data, ignoring age uncertainty for now. 

<details> 
<summary>Hint #1</summary>
  First you'll need to create a lipd-ts-tibble-long object
</details> 
<details>
  <summary>Hint #2</summary>
Then filter it to just the variable of interest
</details> 
<details>
  <summary>Hint #3</summary>
Then use plotStack() to plot it up
</details> 
:::
::::

## Filter the data
OK, there's a lot of data that span a lot of time here. Since we're interested in the deglaciation, we need to find which datasets span the range from at least 10,000 to 30,000 years ago, and that have enough data in that range to be useful. 

There are a few ways to go about this. Here's one that uses `purrr`. Feel free to try a different solution here too!

```{r, warning=FALSE}
indTs <- extractTs(FD2) #create a lipd-timeseries

# create some variables for screening
startYear <- 10
endYear <- 30

#write a function to determine if the dataset has enough values within the target time frame
nGoodVals <- function(x,startYear,endYear){
  agesWithValues <- x$age[is.finite(x$paleoData_values)]
  gv <- which(agesWithValues >= startYear & agesWithValues <= endYear)
  return(length(gv))
}

#write a function to determine how much coverage the data has within that target frame
span <- function(x,startYear,endYear){
  agesWithValues <- x$age[is.finite(x$paleoData_values)]
  agesWithValues <- agesWithValues[agesWithValues >= startYear & agesWithValues <= endYear]
  sout <- abs(diff(range(agesWithValues,na.rm = TRUE)))
  return(sout)
}

#use purrr to run those functions over each item in the list
nValsInRange <- map_dbl(indTs,nGoodVals,startYear,endYear)
span <- map_dbl(indTs,span,startYear,endYear)
```

Now we can see for each of these timeseries how many values are between 10 and 30 ka, and how much time range that span corresponds to. We now filter our TS object down to something closer to what we're looking for, and then select on the variable we're looking for, in this, "surface.temp"

```{r}
#use our indices from above to select only the good timeseries
TS.filtered <- indTs[nValsInRange > 20 & span > 15] %>% 
  filterTs("paleoData_variableName == surface.temp") #and then select our variable of interest
```
This is the first time we've seen the `filterTs()` function. It's part of the `lipdR` library, and enables *simple* filtering on lipd-ts objects. It's a weaker version of `dplyr::filter()`, and for complex queries you're much better off converting to a tibble and using dplyr. But for simple operations on lipd-ts objects, it works well enough. 

### Plot the data again (#plotAgain)

At this point, we should get our eyes on the data again. Let's make a timeseries plot of the existing data. 
```{r}
#convert the lipd-ts object to a lipd-ts-tibble
tsTib <- ts2tibble(TS.filtered) 

#convert the lipd-ts-tibble object to a lipd-ts-tibble-long
tp <- tidyTs(tsTib,age.var = "age") 

#filter it only to our time range
tp <- tp %>% 
  filter(between(age,10,30))

#and make a timeseries stack plot
plotTimeseriesStack(tp,time.var = "age")
```

OK, it looks like our filtering worked as expected. We now have a lipd-timeseries object that contains only the timeseries we want to include in our PCA, and the age ensembles are included. We're ready to move forward with ensemble PCA!

One thing you should notice, is that several of these temperature reconstructions come from the same site. Having multiple, semi-independent records from a single site is relatively common, and something we should consider more down the road.

To conduct PCA, we need to put all of these data onto a common timescale. We will use binning to do this, although there are also other approaches. We also want to repeat the binning across our different ensemble members, recognizing that age uncertainty affects the bins! binTs will bin all the data in the TS from 10 ka to 23 ka into 5 year bins. 

```{r binTS1,results="hide",cache=TRUE}
binned.TS <- binTs(TS.filtered,bin.vec = seq(10,30,by=.5),time.var = "ageEnsemble")
```

We're now ready to calculate the PCA!

## Calculate an ensemble PCA

Calculate PCA, using a covariance matrix, randomly selecting ensemble members in each iteration:
```{r pcaEns1,results="hide",warning=FALSE,cache = TRUE}
pcout <- pcaEns(binned.TS,pca.type = "cov")
```

That was easy (because of all the work we did beforehand). But before we look at the results let's take a look at a scree plot to get a sense of how many significant components we should expect. 

```{r}
plotScreeEns(pcout)
```


It looks like the first two components, shown in black with gray uncertainty shading, stand out above the null model (in red), but the third and beyond look marginal to insignficant. Let's focus on the first two components.

### Plot the ensemble PCA results

Now let's visualize the results. The `plotPcaEns` function will create multiple diagnostic figures of the results, and stitch them together. 

```{r}
plotPCA <-  plotPcaEns(pcout,
                       TS = TS.filtered,
                       map.type = "line",
                       f=.1,
                       legend.position = c(0.5,.6),
                       which.pcs = 1:2,
                       which.leg = 2)
```

Nice! A summary plot that combines the major features is produced, but all of the components, are included in the "plotPCA" list that was exported. 

For comparison with other datasets it can be useful to export quantile timeseries shown in the figures. `plotTimeseriesEnsRibbons()` can optionally be used to export the data rather than plotting them. The following will export the PC1 timeseries:

```{r}
quantileData <- plotTimeseriesEnsRibbons(X = pcout$age,Y = pcout$PCs[,1,],export.quantiles = TRUE)

print(quantileData)
```


### Data weighting

As mentioned in /@ref(plotAgain), we might want to consider the fact that multiple records are coming from the same sites. This could make some sites more influential than others in the PCA. One way to account for this is by "weighting" the analysis. Here we'll rerun the analysis, weighting each record by $1/n$, where n is the number of records at each site.

```{r}
#use pullTsVariable to pull a variable out of a lipd-ts object
sitenames <- pullTsVariable(TS.filtered,"geo_siteName")

#use purrr to weight by 1/n
weights <- purrr::map_dbl(sitenames,~ 1/sum(.x == sitenames))

names(weights) <- sitenames
weights
```

Now let's rerun the PCA, using the weights

```{r,results="hide",warning=FALSE,cache = TRUE}
pcoutWeighted <- pcaEns(binned.TS,pca.type = "cov",weights = weights)
```

and check out the screeplot

```{r}
plotScreeEns(pcoutWeighted)
```

and PCA results again:

```{r}
plotPCA <-  plotPcaEns(pcoutWeighted,
                       TS = TS.filtered,
                       map.type = "line",
                       f=.1,
                       legend.position = c(0.5,.6),
                       which.pcs = 1:2,
                       which.leg = 2)
```

In this case, the results are pretty similar. But weighting your PCA input data is a good tool to have in your tool belt.

## Ensemble PCA using a correlation matrix.
Let's repeat much of this analysis, but this time let's take a look at the data underlying the surface temperature reconstructions, $\delta^{18}O$ and Mg/Ca data.

First, look at the variable names represented in our dataset. This is easiest with our tibble 
```{r}
tib.filtered <- indTs[nValsInRange > 20 & span > 15] %>% ts2tibble()

table(tib.filtered$paleoData_variableName)
```

OK. Let's filter the timeseries again, this time pulling all the $\delta^{18}O$ and Mg/Ca data.

:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCafilter}
Write code to filter tib.filtered to only includes the planktonic d18O and $\delta^{18}O$ and Mg/Ca data.
:::
::::

```{r,results="hide",eval = TRUE,echo = FALSE}
newTS <- tib.filtered %>% 
  filter(paleoData_variableName == "planktonic.d18O" | paleoData_variableName == "planktonic.MgCa") %>% 
  as.lipdTs()
```

Great. It's pretty easy to filter the data if you're used to dplyr, and if you're not, it's not too hard to learn. Now let's take a look at the data. 

:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCaPlotstack}
Create a plotStack visualizing these data. Color the timeseries by variable name. 
:::
::::

Take a look at your plot, note the units, and the scale of the variability. Since we're now dealing with variables in different units, we cannot use a covariance matrix.
```{r binTs2,results="hide",eval = TRUE,echo = FALSE, cache = TRUE}
binned.TS2 <- binTs(newTS,bin.vec = seq(10,30,by=.5),time.var = "ageEnsemble")
```

And calculate the ensemble PCA, this time using a covariance matrix. By using a covariance matrix, we'll allow records that have larger variability in $\delta^{18}O$ to influence the PCA more, and those that have little variability will have little impact. This may or may not be a good idea, but it's an important option to consider when possible. 

```{r pcaEns2,results="hide",warning = FALSE}
pcout2 <- pcaEns(binned.TS2,pca.type = "corr")
```

Once again, let's take a look at the scree plot:

```{r,fig.width = 4,fig.height = 4}
plotScreeEns(pcout2)
```

Once again, the first PC dominates the variability. All of the subsequent PCs are non signficant, but let's include the second third PC in our plot this time as well. 

:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCaPlotPca}
Create a PCA plot showing the output of your data. This time let's explore the options. 


1. Show the first 3 PCs
2. Select a different map background
3. Change the shapes
4. Change the color scale. 
:::
::::

```{r,fig.width = 8,fig.height = 6,warning = FALSE,echo=FALSE,eval=FALSE}
plotPCA2 <-  plotPcaEns(pcout2,
                        TS = newTS,
                        which.pcs = 1:3,  
                        map.type = "stamen",
                        dot.size = 7,
                        shape.by.archive = FALSE,
                        high.color = "purple",
                        low.color = "green",
                        f=.2)
```

Again, the scree plot tells us that only the first EOF pattern is worth interpreting, so let's interpret it!


:::: {.blackbox data-latex=""}
::: {.exercise #interpret}
1. Is the timeseries more or less what you expected? How does it compare from what we got from the surface temperature data?
2. The spatial pattern show high positive and high negative values? How does this compare with our previous analysis? What might be the origin of this pattern?
:::
::::

## Chapter project

:::: {.blackbox data-latex=""}
::: {.exercise #pcaProject}
For your chapter project for PCA, we're going to take a bigger picture look at planktonic $\delta^{18}O$. Specifically, I'd like you to conduct time-uncertain PCA on global planktonic $\delta^{18}O$ data that span from 130ka up to the present. I've prepared a subset of the compilation that has at least 100 observations between 130 ka and the present, and that spans at least 100 kyr during that interval, which is a great start. [That subset is here](https://lipdverse.org/geoChronR-examples/Palmod-Global-d18O/Palmod-Global-d18O.zip). You'll have to make some decisions along the way. Go ahead and make your choices, just have a good reason for each choice you make!  
:::
::::