-
Notifications
You must be signed in to change notification settings - Fork 0
/
pca.Rmd
335 lines (246 loc) · 13.8 KB
/
pca.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Principal Components Analysis {#pca}
As always, let's start by loading the packages we need.
```{r, results = 'hide', warning = FALSE, message= FALSE }
library(lipdR)
library(geoChronR)
library(magrittr)
library(dplyr)
library(purrr)
library(ggplot2)
```
This vignette showcases the ability to perform principal component analysis (PCA, also known as empirical orthogonal function (EOF) analysis. Data are from the [PalMod](pa) compilation. We'll just load a subset of it here, but the rest is available on the LiPDverse()
```{r indianOceanData,cache=TRUE}
FD <- lipdR::readLipd("http://lipdverse.org/geoChronR-examples/PalMod-IndianOcean/PalMod-IndianOcean.zip")
```
## Let's explore these data!
### Make a map
First, let's take a quick look at where these records are located. geoChronR's `mapLipd` function can create quick maps:
```{r,results="hide",fig.keep="all"}
mapLipd(FD,map.type = "line",f = 0.1)
```
### Grab the age ensembles for each record.
Now we need to map the age ensembles to paleo for all of these datasets. We'll use purrr::map for this, but you could also do it with sapply(). In this case we're going to specify that all of the age ensembles are named "ageEnsemble", and the chron and paleo depth variables. Fortunately, the naming is consistent across PalMod, making this straightforward.
```{r mapIOD,results="hide",message=FALSE}
FD2 = purrr::map(FD,
mapAgeEnsembleToPaleoData,
strict.search = TRUE,
paleo.depth.var = "depth_merged",
chron.depth.var = "depth",
age.var = "ageEnsemble" )
```
### Plot some summary statistics
In this analysis, we're interested in spatiotemporal patterns in the Indian Ocean during the Last Deglaciation. Not all of these records cover that time interval, so let's take a look at the temporal coverage.
```{r}
indTib <- FD2 %>% extractTs() %>% ts2tibble() #create a lipd-timeseries-tibble
#use purrr to extract the minimum and maximum ages for each record
minAge <- map_dbl(indTib$age,min,na.rm = TRUE)
maxAge <- map_dbl(indTib$age,max,na.rm = TRUE)
#plot the distributions of the ages.
ggplot()+ geom_histogram(aes(x = minAge,fill = "Min age")) +
geom_histogram(aes(x = maxAge,fill = "Max age"),alpha = 0.5) +
scale_fill_manual("",values = c("red","black")) +
xlab("Age")
```
OK, so it looks like the ages range from 0 to about 150. But what are the units on those?
:::: {.blackbox data-latex=""}
::: {.exercise #units}
Explore the LiPD data to determine the units on the ages, and update your histogram appropriately.
:::
::::
### Plot the data
Let's start by making a stack plot of the surface temperature data in this dataset, to see what we're working with.
:::: {.blackbox data-latex=""}
::: {.exercise #timeseriesPlot}
Create a "stack plot" of the Surface Temperature data, ignoring age uncertainty for now.
<details>
<summary>Hint #1</summary>
First you'll need to create a lipd-ts-tibble-long object
</details>
<details>
<summary>Hint #2</summary>
Then filter it to just the variable of interest
</details>
<details>
<summary>Hint #3</summary>
Then use plotStack() to plot it up
</details>
:::
::::
## Filter the data
OK, there's a lot of data that span a lot of time here. Since we're interested in the deglaciation, we need to find which datasets span the range from at least 10,000 to 30,000 years ago, and that have enough data in that range to be useful.
There are a few ways to go about this. Here's one that uses `purrr`. Feel free to try a different solution here too!
```{r, warning=FALSE}
indTs <- extractTs(FD2) #create a lipd-timeseries
# create some variables for screening
startYear <- 10
endYear <- 30
#write a function to determine if the dataset has enough values within the target time frame
nGoodVals <- function(x,startYear,endYear){
agesWithValues <- x$age[is.finite(x$paleoData_values)]
gv <- which(agesWithValues >= startYear & agesWithValues <= endYear)
return(length(gv))
}
#write a function to determine how much coverage the data has within that target frame
span <- function(x,startYear,endYear){
agesWithValues <- x$age[is.finite(x$paleoData_values)]
agesWithValues <- agesWithValues[agesWithValues >= startYear & agesWithValues <= endYear]
sout <- abs(diff(range(agesWithValues,na.rm = TRUE)))
return(sout)
}
#use purrr to run those functions over each item in the list
nValsInRange <- map_dbl(indTs,nGoodVals,startYear,endYear)
span <- map_dbl(indTs,span,startYear,endYear)
```
Now we can see for each of these timeseries how many values are between 10 and 30 ka, and how much time range that span corresponds to. We now filter our TS object down to something closer to what we're looking for, and then select on the variable we're looking for, in this, "surface.temp"
```{r}
#use our indices from above to select only the good timeseries
TS.filtered <- indTs[nValsInRange > 20 & span > 15] %>%
filterTs("paleoData_variableName == surface.temp") #and then select our variable of interest
```
This is the first time we've seen the `filterTs()` function. It's part of the `lipdR` library, and enables *simple* filtering on lipd-ts objects. It's a weaker version of `dplyr::filter()`, and for complex queries you're much better off converting to a tibble and using dplyr. But for simple operations on lipd-ts objects, it works well enough.
### Plot the data again (#plotAgain)
At this point, we should get our eyes on the data again. Let's make a timeseries plot of the existing data.
```{r}
#convert the lipd-ts object to a lipd-ts-tibble
tsTib <- ts2tibble(TS.filtered)
#convert the lipd-ts-tibble object to a lipd-ts-tibble-long
tp <- tidyTs(tsTib,age.var = "age")
#filter it only to our time range
tp <- tp %>%
filter(between(age,10,30))
#and make a timeseries stack plot
plotTimeseriesStack(tp,time.var = "age")
```
OK, it looks like our filtering worked as expected. We now have a lipd-timeseries object that contains only the timeseries we want to include in our PCA, and the age ensembles are included. We're ready to move forward with ensemble PCA!
One thing you should notice, is that several of these temperature reconstructions come from the same site. Having multiple, semi-independent records from a single site is relatively common, and something we should consider more down the road.
To conduct PCA, we need to put all of these data onto a common timescale. We will use binning to do this, although there are also other approaches. We also want to repeat the binning across our different ensemble members, recognizing that age uncertainty affects the bins! binTs will bin all the data in the TS from 10 ka to 23 ka into 5 year bins.
```{r binTS1,results="hide",cache=TRUE}
binned.TS <- binTs(TS.filtered,bin.vec = seq(10,30,by=.5),time.var = "ageEnsemble")
```
We're now ready to calculate the PCA!
## Calculate an ensemble PCA
Calculate PCA, using a covariance matrix, randomly selecting ensemble members in each iteration:
```{r pcaEns1,results="hide",warning=FALSE,cache = TRUE}
pcout <- pcaEns(binned.TS,pca.type = "cov")
```
That was easy (because of all the work we did beforehand). But before we look at the results let's take a look at a scree plot to get a sense of how many significant components we should expect.
```{r}
plotScreeEns(pcout)
```
It looks like the first two components, shown in black with gray uncertainty shading, stand out above the null model (in red), but the third and beyond look marginal to insignficant. Let's focus on the first two components.
### Plot the ensemble PCA results
Now let's visualize the results. The `plotPcaEns` function will create multiple diagnostic figures of the results, and stitch them together.
```{r}
plotPCA <- plotPcaEns(pcout,
TS = TS.filtered,
map.type = "line",
f=.1,
legend.position = c(0.5,.6),
which.pcs = 1:2,
which.leg = 2)
```
Nice! A summary plot that combines the major features is produced, but all of the components, are included in the "plotPCA" list that was exported.
For comparison with other datasets it can be useful to export quantile timeseries shown in the figures. `plotTimeseriesEnsRibbons()` can optionally be used to export the data rather than plotting them. The following will export the PC1 timeseries:
```{r}
quantileData <- plotTimeseriesEnsRibbons(X = pcout$age,Y = pcout$PCs[,1,],export.quantiles = TRUE)
print(quantileData)
```
### Data weighting
As mentioned in /@ref(plotAgain), we might want to consider the fact that multiple records are coming from the same sites. This could make some sites more influential than others in the PCA. One way to account for this is by "weighting" the analysis. Here we'll rerun the analysis, weighting each record by $1/n$, where n is the number of records at each site.
```{r}
#use pullTsVariable to pull a variable out of a lipd-ts object
sitenames <- pullTsVariable(TS.filtered,"geo_siteName")
#use purrr to weight by 1/n
weights <- purrr::map_dbl(sitenames,~ 1/sum(.x == sitenames))
names(weights) <- sitenames
weights
```
Now let's rerun the PCA, using the weights
```{r,results="hide",warning=FALSE,cache = TRUE}
pcoutWeighted <- pcaEns(binned.TS,pca.type = "cov",weights = weights)
```
and check out the screeplot
```{r}
plotScreeEns(pcoutWeighted)
```
and PCA results again:
```{r}
plotPCA <- plotPcaEns(pcoutWeighted,
TS = TS.filtered,
map.type = "line",
f=.1,
legend.position = c(0.5,.6),
which.pcs = 1:2,
which.leg = 2)
```
In this case, the results are pretty similar. But weighting your PCA input data is a good tool to have in your tool belt.
## Ensemble PCA using a correlation matrix.
Let's repeat much of this analysis, but this time let's take a look at the data underlying the surface temperature reconstructions, $\delta^{18}O$ and Mg/Ca data.
First, look at the variable names represented in our dataset. This is easiest with our tibble
```{r}
tib.filtered <- indTs[nValsInRange > 20 & span > 15] %>% ts2tibble()
table(tib.filtered$paleoData_variableName)
```
OK. Let's filter the timeseries again, this time pulling all the $\delta^{18}O$ and Mg/Ca data.
:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCafilter}
Write code to filter tib.filtered to only includes the planktonic d18O and $\delta^{18}O$ and Mg/Ca data.
:::
::::
```{r,results="hide",eval = TRUE,echo = FALSE}
newTS <- tib.filtered %>%
filter(paleoData_variableName == "planktonic.d18O" | paleoData_variableName == "planktonic.MgCa") %>%
as.lipdTs()
```
Great. It's pretty easy to filter the data if you're used to dplyr, and if you're not, it's not too hard to learn. Now let's take a look at the data.
:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCaPlotstack}
Create a plotStack visualizing these data. Color the timeseries by variable name.
:::
::::
Take a look at your plot, note the units, and the scale of the variability. Since we're now dealing with variables in different units, we cannot use a covariance matrix.
```{r binTs2,results="hide",eval = TRUE,echo = FALSE, cache = TRUE}
binned.TS2 <- binTs(newTS,bin.vec = seq(10,30,by=.5),time.var = "ageEnsemble")
```
And calculate the ensemble PCA, this time using a covariance matrix. By using a covariance matrix, we'll allow records that have larger variability in $\delta^{18}O$ to influence the PCA more, and those that have little variability will have little impact. This may or may not be a good idea, but it's an important option to consider when possible.
```{r pcaEns2,results="hide",warning = FALSE}
pcout2 <- pcaEns(binned.TS2,pca.type = "corr")
```
Once again, let's take a look at the scree plot:
```{r,fig.width = 4,fig.height = 4}
plotScreeEns(pcout2)
```
Once again, the first PC dominates the variability. All of the subsequent PCs are non signficant, but let's include the second third PC in our plot this time as well.
:::: {.blackbox data-latex=""}
::: {.exercise #d18OMgCaPlotPca}
Create a PCA plot showing the output of your data. This time let's explore the options.
1. Show the first 3 PCs
2. Select a different map background
3. Change the shapes
4. Change the color scale.
:::
::::
```{r,fig.width = 8,fig.height = 6,warning = FALSE,echo=FALSE,eval=FALSE}
plotPCA2 <- plotPcaEns(pcout2,
TS = newTS,
which.pcs = 1:3,
map.type = "stamen",
dot.size = 7,
shape.by.archive = FALSE,
high.color = "purple",
low.color = "green",
f=.2)
```
Again, the scree plot tells us that only the first EOF pattern is worth interpreting, so let's interpret it!
:::: {.blackbox data-latex=""}
::: {.exercise #interpret}
1. Is the timeseries more or less what you expected? How does it compare from what we got from the surface temperature data?
2. The spatial pattern show high positive and high negative values? How does this compare with our previous analysis? What might be the origin of this pattern?
:::
::::
## Chapter project
:::: {.blackbox data-latex=""}
::: {.exercise #pcaProject}
For your chapter project for PCA, we're going to take a bigger picture look at planktonic $\delta^{18}O$. Specifically, I'd like you to conduct time-uncertain PCA on global planktonic $\delta^{18}O$ data that span from 130ka up to the present. I've prepared a subset of the compilation that has at least 100 observations between 130 ka and the present, and that spans at least 100 kyr during that interval, which is a great start. [That subset is here](https://lipdverse.org/geoChronR-examples/Palmod-Global-d18O/Palmod-Global-d18O.zip). You'll have to make some decisions along the way. Go ahead and make your choices, just have a good reason for each choice you make!
:::
::::