preprocessing.qmd

# Data pre-processing {#sec-preprocessing}

```{r}
#| child: "_setup.qmd"
```

```{r loading-packages}
#| include: false

library(targets)
library(moiraine)
library(MultiDataSet)
library(pcaMethods)
library(purrr)
library(rlang)
library(ggplot2)
library(RColorBrewer)

## For loading the transformations info
library(DESeq2)
library(S4Vectors)
library(stats4)

## to avoid the S3 overwrite message
library(GGally)
```

```{r setup-visible, eval = FALSE}
library(targets)
library(moiraine)
library(MultiDataSet)

## For PCA results
library(pcaMethods)

## For working with lists
library(purrr)
library(rlang)
## For custom colour palettes
library(ggplot2)
library(RColorBrewer)

## For displaying documentation
library(DESeq2)
library(S4Vectors)
library(stats4)
```

Once each omics dataset has been imported into R with associated metadata and combined into a `MultiDataSet` object, there is a number of pre-processing steps that should be considered. In this chapter, we will show how to apply different transformations to the datasets, as well as how to run a PCA on each dataset and impute missing values if needed.

As a reminder, here is what the `_targets.R` script should look like so far:

<details>
  <summary>`_targets.R` script</summary>
  
```{r previous-vignettes}
#| echo: false
#| results: asis

knitr::current_input() |> 
  get_previous_content() |> 
  cat()
```

</details>


## Datasets transformations

After inspection of the density plots for the different datasets (see @sec-inspecting-multidataset-summary-plots), it might be necessary to normalise or transform some or all datasets. This is necessary to mitigate the mean-variance trend that occurs in RNAseq data, for example, or simply to bring the different features to a comparable scale. Transformation here refers to applying a function to each feature (i.e. each row) within a dataset that will transform the measurement values for the feature.

`moiraine` implements several options to transform an omics dataset:

-   Variance Stabilising Normalisation (VSN) through the `vsn` package -- recommended for metabolomics datasets or other continuous datasets with a strong mean-variance trend;

-   Variance Stabilising Transformation (VST) through the `DESeq2` package -- recommended for RNAseq data or any raw read count-type data;

-   Automatic selection of the best normalisation method for each feature through the `bestNormalize` package -- recommended for phenotype data, and when the number of features is small (note that the selection of the normalisation method is done independently for each feature, so the same transformation might not be applied to all features);

-   A selection of common normalisation methods through the `bestNormalize` package, including center/scale, log, exponential, square-root, arcsinh, Box Cox, Yeo-Johnson and ordered quantile transformations (see details in the [bestNormalize vignette](https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html)) -- recommended when applying the same transformation to all features, e.g. log2 transformation or centering.


### Transforming a single dataset

The transformation of one dataset is done through the `transform_dataset()` function, which takes as input a `MultiDataSet` object, the name of the dataset to transform, and the name of the transformation to be applied, which should be one of `vsn`, `vst-deseq2`, `logx`, `best-normalize-auto` or `best-normalize-manual`. For the latter, the name of the normalisation method from the `BestNormalize` package to use must also be supplied through the `method` argument. 

The `return_multidataset` argument determines whether a `MultiDataSet` object with the corresponding dataset transformed should be returned. If it is set to `FALSE`, the function will instead return a list with the transformed dataset as a matrix as well as other useful information returned by the transformation function applied. It is possible to only return the transformed matrix, by setting `return_matrix_only` to `TRUE`. This can be useful to quickly assess the effects of the transformation outside of the analysis pipeline. 

For example, we can apply the Variance Stabilising Transformation to the transcriptomics dataset:

```{r apply-vst-example}
tar_load(mo_set_de)

rnaseq_vst <- transform_dataset(
  mo_set_de,
  "rnaseq",
  "vst-deseq2",
  return_multidataset = FALSE
)
```
The function returns a list, with the transformed dataset as matrix in the `transformed_data` element. Information generated during the transformation by the `DESeq2` package is stored in the `info_transformation` element. The name of the transformation applied is stored in the `transformation` element:

```{r print-vst-transformation}
names(rnaseq_vst)

rnaseq_vst$transformed_data[1:5, 1:5]

rnaseq_vst$info_transformation

rnaseq_vst$transformation
```

If we instead want to apply a log2 transformation to the dataset, we will use the `logx` transformation option. In that case, we have to specify the log base to use (here 2) through the `log_base` argument, as well as the function that should be applied to the matrix prior to the log-transformation through the `pre_log_function` argument. This function is here mostly to take care of zero values that could cause an issue during the log transformation. By default, the `zero_to_half_min()` function is used, which replaces zero values in the dataset with half of the minimum non-null value found in the dataset. However, it is possible to pass any custom function instead, for example `function(x) x` would not make any change to the dataset before the log-transformation:

```{r apply-log2-example}
rnaseq_log2 <- transform_dataset(
  mo_set_de,
  "rnaseq",
  "logx",
  base_log = 2,
  pre_log_function = function(x) x,
  return_multidataset = TRUE
)
```

In that case, we asked the function to return a `MultiDataSet` object, in which the `rnaseq` dataset has been transformed:

```{r print-log2-transformation}
rnaseq_log2

get_dataset_matrix(rnaseq_log2, "rnaseq")[1:5, 1:5]
```

The two other transformation options are `best-normalize-auto` and `best-normalize-manual`. Both rely on the `bestNormalize` package to apply a transformation **independently** to each feature in the dataset. With `best-normalize-auto`, the `bestNormalize::bestNormalize()` function automatically selects for each feature the transformation that results in a distribution of values closest to Gaussian as possible. Note that as the function is applied independently to each feature, a different distribution may be chosen for different features. With the `best-normalize-manual` option, the same transformation is applied to each feature. The transformation to use must be specified through the `method` argument, and should be the name of one of the transformation implemented in `bestNormalize`: `"arcsinh_x"`, `"boxcox"`, `"center_scale"`, `"exp_x"`, `"log_x"`, `"orderNorm"`, `"sqrt_x"` or `"yeojohnson"`. Note that with this option, even if the same transformation is applied to each feature, the parameters for the transformation, if not specified, will be independently estimated for each feature. For example, with the `log_x` method, the function automatically adds a small offset (`a`) to the feature's measurements if there are any zero values. The value used for this offset might be different for each feature. We can instead set a value for `a` to be used for all features, as follows:

```{r bestnormalize-manual-example}
#| eval: false

transform_dataset(
  mo_set_de,
  "rnaseq",
  "best-normalize-manual",
  method = "log_x",
  a = 0.1,
  return_multidataset = TRUE
)
```

::: {.callout-note}
Before using the `best-normalize-manual` option, it is strongly recommended to have a look at the documentation of the corresponding function from the `bestNormalize` package, as they might have behaviours that are not expected. For example, by default the `bestNormalize::log_x()` function uses a log base 10, and centers and scales the transformed values, which may not be what we want.
:::

### Transformation target factory

The target factory function `transformation_datasets_factory()` provides a wrapper to apply (potentially different) transformations to several datasets at once. The function takes as input the `MultiDataSet` object as well as a named character vector, in which each element corresponds to a transformation that should be applied to a specific dataset. If a dataset is not present in the transformation vector, it will not be transformed (but it will still be present in the resulting `MultiDataSet` object).

Here, we would like to apply Variance Stabilising Transformation to the transcriptomics dataset, and a log2 transformation to the metabolomics dataset. Note that the VST and VSN transformations are very close to the log2 transformation, especially for features with high means.

::: {.targets-chunk}
```{targets transformation-datasets-factory}
#| message: true

transformation_datasets_factory(
  mo_set_de,
  c("rnaseq" = "vst-deseq2",
    "metabolome" = "logx"),
  log_bases = 2,
  pre_log_functions = zero_to_half_min,
  transformed_data_name = "mo_set_transformed"
)
```
:::

The `transformation_datasets_factory()` function works as follows:

-   It creates a grouped tibble listing the transformation to apply to each dataset, stored in the `transformations_spec` target;

```{r transformations-spec}
tar_read(transformations_spec)
```

-   It performs the required transformation on each dataset via [dynamic branching](https://books.ropensci.org/targets/dynamic.html). This is done through a call to the `transform_dataset()` function. The transformed datasets are stored in a list, in the `transformations_runs_list` target. Note that by default the function will store all details of the transformations, which can be useful for later inspection, but can be memory-intensive. It is possible to only store the transformed datasets instead, by setting the `return_matrix_only` argument to `TRUE` in the `transformation_datasets_factory()` call.

```{r show-transformations-run-list}
tar_load(transformations_runs_list)

names(transformations_runs_list)

map_chr(transformations_runs_list, attr, "dataset_name")

transformations_runs_list[["transformations_runs_list_a1c8db41"]] |> names()
```

-   It creates a new `MultiDataSet` object, with the transformed version of the datasets. By default, this new `MultiDataSet` object is stored in a target called `transformed_set`, but a different name can be specified via the `transformed_data_name` argument (here we called it `mo_set_transformed`).

```{r print-mo-set-transformed}
tar_load(mo_set_transformed)

mo_set_transformed

get_dataset_matrix(mo_set_de, "metabolome")[1:5, 1:3]

get_dataset_matrix(mo_set_transformed, "metabolome")[1:5, 1:3]
```

<details>

<summary>Converting targets factory to R script</summary>

```{r transformation-datasets-factory-to-script}
#| eval: false

transformations_runs_list <- c(
  "rnaseq" = "vst-deseq2", 
  "metabolome" = "logx"
) |>  imap(\(x, y) {
    transform_dataset(
      mo_set_de,
      dataset = y,
      transformation = x,
      log_base = 2,
      pre_log_function = zero_to_half_min
    )
  }
)

mo_set_transformed <- get_transformed_data(mo_set_de, transformations_runs_list)
```

</details>

We can assess the effect of the transformations by generating density and mean-sd plots for the transformed datasets:

```{r plot-density-transformed-data}
#| fig.width: 9
#| fig.height: 4

plot_density_data(
  mo_set_transformed,
  combined = FALSE,
  scales = "free"
)
```

Note how the relationship between features mean and standard deviation has been reduced in both transformed datasets:

```{r plot-meansd-transformed-data, fig.width = 9, fig.height = 4}
plot_meansd_data(mo_set_transformed)
```

Finally, it can be useful to summarise which transformations have been applied to the datasets, for example when creating a report. The function `get_table_transformation()` is here for that. It takes as an input the `transformations_runs_list` target generated by `transformation_datasets_factory()`, and returns a tibble indicating the transformation applied to each dataset:

```{r get-table-transformation}
get_table_transformations(transformations_runs_list)
```

## Running a PCA on each dataset

It is always best practice to run some exploratory analysis on a dataset prior to running analyses. This is largely outside the scope of this package, and we assume that any input dataset has been properly assessed before turning to the integration pipeline. However, running a Principal Component Analysis (PCA) on each of the omics datasets within the integration pipeline serves two purposes:

- as a last check to ensure that there are no obvious batch effects or problematic samples that should be addressed,

- as a missing data imputation method.

The `moiraine` package relies on the Bioconductor [`pcaMethods` package](https://bioconductor.org/packages/release/bioc/html/pcaMethods.html) to perform the PCA. In particular, the `pcaMethods` package implements a NIPALS (non-linear iterative partial least squares) method for PCA, which allows for missing values in the input dataset, and imputes missing values based on the results of the PCA.

### Running the PCAs

The `pca_complete_data_factory()` function uses dynamic branching to perform a PCA on each omics dataset within a `MultiDataSet` object. It takes as input the `MultiDataSet` object (in our case, `mo_set_transformed`), and, optionally, the names of the datasets on which a PCA should be run. This is useful if one dataset is very large and has no missing values, and we want to avoid running a PCA on it. It then proceeds as follows:

-   It creates a target called `dataset_names_pca`, which stores a vector of dataset names on which a PCA should be applied;

-   For each value in `dataset_names_pca`, it extracts the omics dataset as a matrix with features as rows and samples as columns, using the `get_dataset_matrix()` function. This is done via dynamic branching, and the results are stored as a list in the `pca_mats_list` target. Note that the names of this list are not meaningful; to check which element of the list corresponds to which dataset, you can run `map_chr(pca_mats_list, attr, "dataset_name")`;

-   For each matrix in `pca_mats_list`, it applies the `run_pca_matrix()` function to the corresponding dataset. This is done via dynamic branching; it results in a list where each element is the PCA result (i.e. a `pcaMethods::pcaRes` object) for a given dataset. This list is stored in the `pca_runs_list` target. Note that the names of this list are not meaningful; to check which element of the list corresponds to which dataset, you can run `map_chr(pca_runs_list, attr, "dataset_name")`;

-   It extracts from the result of each PCA the complete dataset, i.e. with missing values imputed, and uses this information to construct a new `MultiDataSet` object, in which the datasets are complete (i.e. no missing value). This is done by calling the `get_complete_data()` function. If no PCA was run on a dataset, the dataset will still be present in the new `MultiDataSet` object, but its missing values will not be imputed. The resulting complete `MultiDataSet` object is stored by default in a target called `complete_set`; this name can be changed via the `complete_data_name` argument.

Let's apply this to our multi-omics dataset:

::: {.targets-chunk}
```{targets pca-complete-data-factory}
pca_complete_data_factory(
  mo_set_transformed,
  complete_data_name = "mo_set_complete"
)
```
:::

We can have a look at the different targets constructed. By default, a PCA was run on all datasets:

```{r print-dataset-names-pca}
tar_read(dataset_names_pca)
```

```{r show-pcapmats-list-names}
tar_load(pca_mats_list)

map_chr(pca_mats_list, attr, "dataset_name")
map(pca_mats_list, ~.x[1:5, 1:5])
```


```{r show-pca-runs-list-names}
tar_load(pca_runs_list)

names(pca_runs_list)

map_chr(pca_runs_list, attr, "dataset_name")
```

The result of the PCA run on the genomics dataset looks like this:

```{r show-pca-result}
tar_read(pca_runs_list_74d71ae8)
```

You can notice that there is some information about the number of principal components computed, and whether the dataset was centred and scaled before applying the PCA. This is handled by the default arguments of `run_pca_matrix()`, but can be specified by passing the corresponding arguments to `pca_complete_data_factory()`. For example, to scale the datasets before performing a PCA, we could use:

::: {.targets-chunk}
```{targets pca-complete-data-factory-with-scaling}
pca_complete_data_factory(
  mo_set_transformed,
  complete_data_name = "mo_set_complete",
  scale = TRUE
)
```
:::

For convenience, the `run_pca()` function can be used to run a PCA on one of the omics datasets directly from a `MultiDataSet` object. It is a wrapper around the `run_pca_matrix()` function, and takes as input a `MultiDataSet` object as well as the name of the omics dataset on which a PCA should be run, e.g.:

```{r example-run-pca}
#| eval: false

run_pca(mo_set_de, "rnaseq")
```

<details>

<summary>Converting targets factory to R script</summary>

```{r pca-complete-data-factory-to-script}
#| eval: false

pca_mats_list <- names(mo_set_transformed) |> 
  set_names() |> 
  map(\(x) get_dataset_matrix(mo_set_transformed, x, keep_dataset_name = TRUE))

pca_runs_list <- pca_mats_list |> 
  map(run_pca_matrix)

mo_set_complete <- get_complete_data(mo_set_transformed, pca_runs_list)
```

</details>

### Visualising the PCA results

It is possible to get an overview of the results of each PCA. First, the function `plot_screeplot_pca()` displays the percentage of variance explained by the principal components computed for each dataset. It takes as input the `pca_runs_list` target constructed in the previous step. Note that by default, 10 components are computed for each dataset.

```{r plot-screeplot-pca}
#| fig.width: 8
#| fig.height: 8

plot_screeplot_pca(pca_runs_list)
```

In addition, the `plot_samples_coordinates_pca` allows us to display the samples in the reduced principal components space (the common PCA sample plot). The function returns a list of plots (one plot per dataset). By default, it shows all principal components computed for each dataset, but for clarity we will only look at the first three:

```{r plot-samples-coordinates-pca}
#| fig.width: 8
#| fig.height: 7
#| warning: false

plot_samples_coordinates_pca(
  pca_runs_list,
  pcs = 1:3
)
```

Note that it is possible to look at a different set of principal components for each dataset. For that, the index of the principal components should be passed to the `pcs` argument as a named list (where the name of each element corresponds to a dataset name), e.g.:

```{r plot-samples-coordinates-pca-diff-pcs}
#| eval: false

plot_samples_coordinates_pca(
  pca_runs_list,
  pcs = list(
    "snps" = 1:4,
    "rnaseq" = 1:2,
    "metabolome" = 1:3
  )
)
```

By default, the points in the sample plots are not coloured. It is however possible to colour the samples according to the information contained in the sample metadata tables available through the `MultiDataset` object. We can set different colours and shapes for the upper and lower plots in the scatterplot matrix, see the `plot_samples_score()` function for more information. For example, we can assess whether the first three principal components show any clustering of the samples according to their cluster computed from genomics similarity, disease status or feedlot (we'll only show the results for the SNPs dataset here):

```{r plot-samples-coordinates-pca-colours}
#| fig.width: 8
#| fig.height: 7
#| warning: false

plot_samples_coordinates_pca(
  pca_runs_list,
  datasets = "snps",
  pcs = 1:3,
  mo_data = mo_set_de,
  colour_upper = "geno_comp_cluster",
  shape_upper = "status",
  colour_lower = "feedlot"
) +
  theme(legend.box = "vertical")
```


### Missing values imputation {#sec-preprocessing-missing-values}

We can check that the complete multi-omics set constructed has no more missing values:

```{r print-mo-set-complete}
tar_load(mo_set_complete)

mo_set_complete
```

```{r check-missing-values-mo-set-complete}
check_missing_values(mo_set_complete)
```


## Recap -- targets list

For convenience, here is the list of targets that we created in this section:

<details>
  <summary>Targets list for datasets preprocessing</summary>
  
::: {.targets-chunk}
```{targets recap-targets-list}
list(
  ## Applying transformations to the datasets
  transformation_datasets_factory(
    mo_set_de,
    c("rnaseq" = "vst-deseq2",
      "metabolome" = "logx"),
    log_bases = 2,
    pre_log_functions = zero_to_half_min,
    transformed_data_name = "mo_set_transformed"
  ),
  
  ## Density plot for each transformed dataset
  tar_target(
    density_plots_transformed,
    plot_density_data(
      mo_set_transformed,
      combined = FALSE,
      scales = "free"
    )
  ),
  
  ## Plotting the mean-SD trend for transformed each dataset
  tar_target(
    mean_sd_plots_transformed,
    plot_meansd_data(mo_set_transformed)
  ),
  
  ## Summary table of the transformations applied
  tar_target(
    transformation_summary,
    get_table_transformations(transformations_runs_list)
  ),
  
  ## Running a PCA on each dataset
  pca_complete_data_factory(
    mo_set_transformed,
    complete_data_name = "mo_set_complete"
  ),
  
  ## PCA screeplots
  tar_target(
    pca_screeplots,
    plot_screeplot_pca(pca_runs_list)
  ),
  
  ## PCA sample plots
  tar_target(
    pca_sample_plots,
    plot_samples_coordinates_pca(
      pca_runs_list,
      datasets = "snps",
      pcs = 1:3,
      mo_data = mo_set_de,
      colour_upper = "geno_comp_cluster",
      shape_upper = "status",
      colour_lower = "feedlot"
    )
  )
)
```
:::
</details>