-
Notifications
You must be signed in to change notification settings - Fork 0
/
preprocessing.qmd
501 lines (362 loc) · 21.4 KB
/
preprocessing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
# Data pre-processing {#sec-preprocessing}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(MultiDataSet)
library(pcaMethods)
library(purrr)
library(rlang)
library(ggplot2)
library(RColorBrewer)
## For loading the transformations info
library(DESeq2)
library(S4Vectors)
library(stats4)
## to avoid the S3 overwrite message
library(GGally)
```
```{r setup-visible, eval = FALSE}
library(targets)
library(moiraine)
library(MultiDataSet)
## For PCA results
library(pcaMethods)
## For working with lists
library(purrr)
library(rlang)
## For custom colour palettes
library(ggplot2)
library(RColorBrewer)
## For displaying documentation
library(DESeq2)
library(S4Vectors)
library(stats4)
```
Once each omics dataset has been imported into R with associated metadata and combined into a `MultiDataSet` object, there is a number of pre-processing steps that should be considered. In this chapter, we will show how to apply different transformations to the datasets, as well as how to run a PCA on each dataset and impute missing values if needed.
As a reminder, here is what the `_targets.R` script should look like so far:
<details>
<summary>`_targets.R` script</summary>
```{r previous-vignettes}
#| echo: false
#| results: asis
knitr::current_input() |>
get_previous_content() |>
cat()
```
</details>
## Datasets transformations
After inspection of the density plots for the different datasets (see @sec-inspecting-multidataset-summary-plots), it might be necessary to normalise or transform some or all datasets. This is necessary to mitigate the mean-variance trend that occurs in RNAseq data, for example, or simply to bring the different features to a comparable scale. Transformation here refers to applying a function to each feature (i.e. each row) within a dataset that will transform the measurement values for the feature.
`moiraine` implements several options to transform an omics dataset:
- Variance Stabilising Normalisation (VSN) through the `vsn` package -- recommended for metabolomics datasets or other continuous datasets with a strong mean-variance trend;
- Variance Stabilising Transformation (VST) through the `DESeq2` package -- recommended for RNAseq data or any raw read count-type data;
- Automatic selection of the best normalisation method for each feature through the `bestNormalize` package -- recommended for phenotype data, and when the number of features is small (note that the selection of the normalisation method is done independently for each feature, so the same transformation might not be applied to all features);
- A selection of common normalisation methods through the `bestNormalize` package, including center/scale, log, exponential, square-root, arcsinh, Box Cox, Yeo-Johnson and ordered quantile transformations (see details in the [bestNormalize vignette](https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html)) -- recommended when applying the same transformation to all features, e.g. log2 transformation or centering.
### Transforming a single dataset
The transformation of one dataset is done through the `transform_dataset()` function, which takes as input a `MultiDataSet` object, the name of the dataset to transform, and the name of the transformation to be applied, which should be one of `vsn`, `vst-deseq2`, `logx`, `best-normalize-auto` or `best-normalize-manual`. For the latter, the name of the normalisation method from the `BestNormalize` package to use must also be supplied through the `method` argument.
The `return_multidataset` argument determines whether a `MultiDataSet` object with the corresponding dataset transformed should be returned. If it is set to `FALSE`, the function will instead return a list with the transformed dataset as a matrix as well as other useful information returned by the transformation function applied. It is possible to only return the transformed matrix, by setting `return_matrix_only` to `TRUE`. This can be useful to quickly assess the effects of the transformation outside of the analysis pipeline.
For example, we can apply the Variance Stabilising Transformation to the transcriptomics dataset:
```{r apply-vst-example}
tar_load(mo_set_de)
rnaseq_vst <- transform_dataset(
mo_set_de,
"rnaseq",
"vst-deseq2",
return_multidataset = FALSE
)
```
The function returns a list, with the transformed dataset as matrix in the `transformed_data` element. Information generated during the transformation by the `DESeq2` package is stored in the `info_transformation` element. The name of the transformation applied is stored in the `transformation` element:
```{r print-vst-transformation}
names(rnaseq_vst)
rnaseq_vst$transformed_data[1:5, 1:5]
rnaseq_vst$info_transformation
rnaseq_vst$transformation
```
If we instead want to apply a log2 transformation to the dataset, we will use the `logx` transformation option. In that case, we have to specify the log base to use (here 2) through the `log_base` argument, as well as the function that should be applied to the matrix prior to the log-transformation through the `pre_log_function` argument. This function is here mostly to take care of zero values that could cause an issue during the log transformation. By default, the `zero_to_half_min()` function is used, which replaces zero values in the dataset with half of the minimum non-null value found in the dataset. However, it is possible to pass any custom function instead, for example `function(x) x` would not make any change to the dataset before the log-transformation:
```{r apply-log2-example}
rnaseq_log2 <- transform_dataset(
mo_set_de,
"rnaseq",
"logx",
base_log = 2,
pre_log_function = function(x) x,
return_multidataset = TRUE
)
```
In that case, we asked the function to return a `MultiDataSet` object, in which the `rnaseq` dataset has been transformed:
```{r print-log2-transformation}
rnaseq_log2
get_dataset_matrix(rnaseq_log2, "rnaseq")[1:5, 1:5]
```
The two other transformation options are `best-normalize-auto` and `best-normalize-manual`. Both rely on the `bestNormalize` package to apply a transformation **independently** to each feature in the dataset. With `best-normalize-auto`, the `bestNormalize::bestNormalize()` function automatically selects for each feature the transformation that results in a distribution of values closest to Gaussian as possible. Note that as the function is applied independently to each feature, a different distribution may be chosen for different features. With the `best-normalize-manual` option, the same transformation is applied to each feature. The transformation to use must be specified through the `method` argument, and should be the name of one of the transformation implemented in `bestNormalize`: `"arcsinh_x"`, `"boxcox"`, `"center_scale"`, `"exp_x"`, `"log_x"`, `"orderNorm"`, `"sqrt_x"` or `"yeojohnson"`. Note that with this option, even if the same transformation is applied to each feature, the parameters for the transformation, if not specified, will be independently estimated for each feature. For example, with the `log_x` method, the function automatically adds a small offset (`a`) to the feature's measurements if there are any zero values. The value used for this offset might be different for each feature. We can instead set a value for `a` to be used for all features, as follows:
```{r bestnormalize-manual-example}
#| eval: false
transform_dataset(
mo_set_de,
"rnaseq",
"best-normalize-manual",
method = "log_x",
a = 0.1,
return_multidataset = TRUE
)
```
::: {.callout-note}
Before using the `best-normalize-manual` option, it is strongly recommended to have a look at the documentation of the corresponding function from the `bestNormalize` package, as they might have behaviours that are not expected. For example, by default the `bestNormalize::log_x()` function uses a log base 10, and centers and scales the transformed values, which may not be what we want.
:::
### Transformation target factory
The target factory function `transformation_datasets_factory()` provides a wrapper to apply (potentially different) transformations to several datasets at once. The function takes as input the `MultiDataSet` object as well as a named character vector, in which each element corresponds to a transformation that should be applied to a specific dataset. If a dataset is not present in the transformation vector, it will not be transformed (but it will still be present in the resulting `MultiDataSet` object).
Here, we would like to apply Variance Stabilising Transformation to the transcriptomics dataset, and a log2 transformation to the metabolomics dataset. Note that the VST and VSN transformations are very close to the log2 transformation, especially for features with high means.
::: {.targets-chunk}
```{targets transformation-datasets-factory}
#| message: true
transformation_datasets_factory(
mo_set_de,
c("rnaseq" = "vst-deseq2",
"metabolome" = "logx"),
log_bases = 2,
pre_log_functions = zero_to_half_min,
transformed_data_name = "mo_set_transformed"
)
```
:::
The `transformation_datasets_factory()` function works as follows:
- It creates a grouped tibble listing the transformation to apply to each dataset, stored in the `transformations_spec` target;
```{r transformations-spec}
tar_read(transformations_spec)
```
- It performs the required transformation on each dataset via [dynamic branching](https://books.ropensci.org/targets/dynamic.html). This is done through a call to the `transform_dataset()` function. The transformed datasets are stored in a list, in the `transformations_runs_list` target. Note that by default the function will store all details of the transformations, which can be useful for later inspection, but can be memory-intensive. It is possible to only store the transformed datasets instead, by setting the `return_matrix_only` argument to `TRUE` in the `transformation_datasets_factory()` call.
```{r show-transformations-run-list}
tar_load(transformations_runs_list)
names(transformations_runs_list)
map_chr(transformations_runs_list, attr, "dataset_name")
transformations_runs_list[["transformations_runs_list_a1c8db41"]] |> names()
```
- It creates a new `MultiDataSet` object, with the transformed version of the datasets. By default, this new `MultiDataSet` object is stored in a target called `transformed_set`, but a different name can be specified via the `transformed_data_name` argument (here we called it `mo_set_transformed`).
```{r print-mo-set-transformed}
tar_load(mo_set_transformed)
mo_set_transformed
get_dataset_matrix(mo_set_de, "metabolome")[1:5, 1:3]
get_dataset_matrix(mo_set_transformed, "metabolome")[1:5, 1:3]
```
<details>
<summary>Converting targets factory to R script</summary>
```{r transformation-datasets-factory-to-script}
#| eval: false
transformations_runs_list <- c(
"rnaseq" = "vst-deseq2",
"metabolome" = "logx"
) |> imap(\(x, y) {
transform_dataset(
mo_set_de,
dataset = y,
transformation = x,
log_base = 2,
pre_log_function = zero_to_half_min
)
}
)
mo_set_transformed <- get_transformed_data(mo_set_de, transformations_runs_list)
```
</details>
We can assess the effect of the transformations by generating density and mean-sd plots for the transformed datasets:
```{r plot-density-transformed-data}
#| fig.width: 9
#| fig.height: 4
plot_density_data(
mo_set_transformed,
combined = FALSE,
scales = "free"
)
```
Note how the relationship between features mean and standard deviation has been reduced in both transformed datasets:
```{r plot-meansd-transformed-data, fig.width = 9, fig.height = 4}
plot_meansd_data(mo_set_transformed)
```
Finally, it can be useful to summarise which transformations have been applied to the datasets, for example when creating a report. The function `get_table_transformation()` is here for that. It takes as an input the `transformations_runs_list` target generated by `transformation_datasets_factory()`, and returns a tibble indicating the transformation applied to each dataset:
```{r get-table-transformation}
get_table_transformations(transformations_runs_list)
```
## Running a PCA on each dataset
It is always best practice to run some exploratory analysis on a dataset prior to running analyses. This is largely outside the scope of this package, and we assume that any input dataset has been properly assessed before turning to the integration pipeline. However, running a Principal Component Analysis (PCA) on each of the omics datasets within the integration pipeline serves two purposes:
- as a last check to ensure that there are no obvious batch effects or problematic samples that should be addressed,
- as a missing data imputation method.
The `moiraine` package relies on the Bioconductor [`pcaMethods` package](https://bioconductor.org/packages/release/bioc/html/pcaMethods.html) to perform the PCA. In particular, the `pcaMethods` package implements a NIPALS (non-linear iterative partial least squares) method for PCA, which allows for missing values in the input dataset, and imputes missing values based on the results of the PCA.
### Running the PCAs
The `pca_complete_data_factory()` function uses dynamic branching to perform a PCA on each omics dataset within a `MultiDataSet` object. It takes as input the `MultiDataSet` object (in our case, `mo_set_transformed`), and, optionally, the names of the datasets on which a PCA should be run. This is useful if one dataset is very large and has no missing values, and we want to avoid running a PCA on it. It then proceeds as follows:
- It creates a target called `dataset_names_pca`, which stores a vector of dataset names on which a PCA should be applied;
- For each value in `dataset_names_pca`, it extracts the omics dataset as a matrix with features as rows and samples as columns, using the `get_dataset_matrix()` function. This is done via dynamic branching, and the results are stored as a list in the `pca_mats_list` target. Note that the names of this list are not meaningful; to check which element of the list corresponds to which dataset, you can run `map_chr(pca_mats_list, attr, "dataset_name")`;
- For each matrix in `pca_mats_list`, it applies the `run_pca_matrix()` function to the corresponding dataset. This is done via dynamic branching; it results in a list where each element is the PCA result (i.e. a `pcaMethods::pcaRes` object) for a given dataset. This list is stored in the `pca_runs_list` target. Note that the names of this list are not meaningful; to check which element of the list corresponds to which dataset, you can run `map_chr(pca_runs_list, attr, "dataset_name")`;
- It extracts from the result of each PCA the complete dataset, i.e. with missing values imputed, and uses this information to construct a new `MultiDataSet` object, in which the datasets are complete (i.e. no missing value). This is done by calling the `get_complete_data()` function. If no PCA was run on a dataset, the dataset will still be present in the new `MultiDataSet` object, but its missing values will not be imputed. The resulting complete `MultiDataSet` object is stored by default in a target called `complete_set`; this name can be changed via the `complete_data_name` argument.
Let's apply this to our multi-omics dataset:
::: {.targets-chunk}
```{targets pca-complete-data-factory}
pca_complete_data_factory(
mo_set_transformed,
complete_data_name = "mo_set_complete"
)
```
:::
We can have a look at the different targets constructed. By default, a PCA was run on all datasets:
```{r print-dataset-names-pca}
tar_read(dataset_names_pca)
```
```{r show-pcapmats-list-names}
tar_load(pca_mats_list)
map_chr(pca_mats_list, attr, "dataset_name")
map(pca_mats_list, ~.x[1:5, 1:5])
```
```{r show-pca-runs-list-names}
tar_load(pca_runs_list)
names(pca_runs_list)
map_chr(pca_runs_list, attr, "dataset_name")
```
The result of the PCA run on the genomics dataset looks like this:
```{r show-pca-result}
tar_read(pca_runs_list_74d71ae8)
```
You can notice that there is some information about the number of principal components computed, and whether the dataset was centred and scaled before applying the PCA. This is handled by the default arguments of `run_pca_matrix()`, but can be specified by passing the corresponding arguments to `pca_complete_data_factory()`. For example, to scale the datasets before performing a PCA, we could use:
::: {.targets-chunk}
```{targets pca-complete-data-factory-with-scaling}
pca_complete_data_factory(
mo_set_transformed,
complete_data_name = "mo_set_complete",
scale = TRUE
)
```
:::
For convenience, the `run_pca()` function can be used to run a PCA on one of the omics datasets directly from a `MultiDataSet` object. It is a wrapper around the `run_pca_matrix()` function, and takes as input a `MultiDataSet` object as well as the name of the omics dataset on which a PCA should be run, e.g.:
```{r example-run-pca}
#| eval: false
run_pca(mo_set_de, "rnaseq")
```
<details>
<summary>Converting targets factory to R script</summary>
```{r pca-complete-data-factory-to-script}
#| eval: false
pca_mats_list <- names(mo_set_transformed) |>
set_names() |>
map(\(x) get_dataset_matrix(mo_set_transformed, x, keep_dataset_name = TRUE))
pca_runs_list <- pca_mats_list |>
map(run_pca_matrix)
mo_set_complete <- get_complete_data(mo_set_transformed, pca_runs_list)
```
</details>
### Visualising the PCA results
It is possible to get an overview of the results of each PCA. First, the function `plot_screeplot_pca()` displays the percentage of variance explained by the principal components computed for each dataset. It takes as input the `pca_runs_list` target constructed in the previous step. Note that by default, 10 components are computed for each dataset.
```{r plot-screeplot-pca}
#| fig.width: 8
#| fig.height: 8
plot_screeplot_pca(pca_runs_list)
```
In addition, the `plot_samples_coordinates_pca` allows us to display the samples in the reduced principal components space (the common PCA sample plot). The function returns a list of plots (one plot per dataset). By default, it shows all principal components computed for each dataset, but for clarity we will only look at the first three:
```{r plot-samples-coordinates-pca}
#| fig.width: 8
#| fig.height: 7
#| warning: false
plot_samples_coordinates_pca(
pca_runs_list,
pcs = 1:3
)
```
Note that it is possible to look at a different set of principal components for each dataset. For that, the index of the principal components should be passed to the `pcs` argument as a named list (where the name of each element corresponds to a dataset name), e.g.:
```{r plot-samples-coordinates-pca-diff-pcs}
#| eval: false
plot_samples_coordinates_pca(
pca_runs_list,
pcs = list(
"snps" = 1:4,
"rnaseq" = 1:2,
"metabolome" = 1:3
)
)
```
By default, the points in the sample plots are not coloured. It is however possible to colour the samples according to the information contained in the sample metadata tables available through the `MultiDataset` object. We can set different colours and shapes for the upper and lower plots in the scatterplot matrix, see the `plot_samples_score()` function for more information. For example, we can assess whether the first three principal components show any clustering of the samples according to their cluster computed from genomics similarity, disease status or feedlot (we'll only show the results for the SNPs dataset here):
```{r plot-samples-coordinates-pca-colours}
#| fig.width: 8
#| fig.height: 7
#| warning: false
plot_samples_coordinates_pca(
pca_runs_list,
datasets = "snps",
pcs = 1:3,
mo_data = mo_set_de,
colour_upper = "geno_comp_cluster",
shape_upper = "status",
colour_lower = "feedlot"
) +
theme(legend.box = "vertical")
```
### Missing values imputation {#sec-preprocessing-missing-values}
We can check that the complete multi-omics set constructed has no more missing values:
```{r print-mo-set-complete}
tar_load(mo_set_complete)
mo_set_complete
```
```{r check-missing-values-mo-set-complete}
check_missing_values(mo_set_complete)
```
## Recap -- targets list
For convenience, here is the list of targets that we created in this section:
<details>
<summary>Targets list for datasets preprocessing</summary>
::: {.targets-chunk}
```{targets recap-targets-list}
list(
## Applying transformations to the datasets
transformation_datasets_factory(
mo_set_de,
c("rnaseq" = "vst-deseq2",
"metabolome" = "logx"),
log_bases = 2,
pre_log_functions = zero_to_half_min,
transformed_data_name = "mo_set_transformed"
),
## Density plot for each transformed dataset
tar_target(
density_plots_transformed,
plot_density_data(
mo_set_transformed,
combined = FALSE,
scales = "free"
)
),
## Plotting the mean-SD trend for transformed each dataset
tar_target(
mean_sd_plots_transformed,
plot_meansd_data(mo_set_transformed)
),
## Summary table of the transformations applied
tar_target(
transformation_summary,
get_table_transformations(transformations_runs_list)
),
## Running a PCA on each dataset
pca_complete_data_factory(
mo_set_transformed,
complete_data_name = "mo_set_complete"
),
## PCA screeplots
tar_target(
pca_screeplots,
plot_screeplot_pca(pca_runs_list)
),
## PCA sample plots
tar_target(
pca_sample_plots,
plot_samples_coordinates_pca(
pca_runs_list,
datasets = "snps",
pcs = 1:3,
mo_data = mo_set_de,
colour_upper = "geno_comp_cluster",
shape_upper = "status",
colour_lower = "feedlot"
)
)
)
```
:::
</details>