02-quality_control.qmd

# Quality Control

```{r, include = FALSE}
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now, units = "secs")
      # return a character string to show the time
      paste("Time for this code chunk to run with ", nCores, " cores: ", round(res, 2), " seconds")
    }
  }
}))
```

With cell segmentation complete, we can move on to performing quality control checks. In many spatial imaging protocols, there tends to be a degree of variability in the intensity of each image. For example, in one image, the CD3 stain may be too strong, but in another image the CD3 staining may be particularly weak. This variability is often inevitable and can be hard to correct for during the imaging process. Hence, it is important that we identify when such variance occurs and correct it. The simpleSeg package includes additional functions for detecting batch effects and normalising expression data to minimize their impact.

<!-- Steps: -->

<!-- 1.  Evaluate quality of cell segmentation with CellSPA (simpleSeg/cellSPA?) -->

<!-- 2.  How to qc image batch-effect (simpleSeg::normalizeCells) -->

<!-- 3.  How to qc patient batch-effect (simpleSeg::normalizeCells) -->

<!-- 4.  How to qc batch effects (scMerge) -->

```{r 02-loadLibraries, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
  library(tidySingleCellExperiment)
  library(simpleSeg)
  # library(scMerge)
  library(scater)
  library(ggplot2)
})
```

```{r, eval = FALSE}
# load required libraries
library(tidySingleCellExperiment)
library(simpleSeg)
library(scater)
library(ggplot2)
```

```{r, setParam}
# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
  nCores <- max(parallel::detectCores()/2, 1)
} else {
  nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
```

<!-- ## CellSPA: How do I determine segmentation quality? -->

<!-- [CellSPA](https://sydneybiox.github.io/CellSPA/) is an R package that provides an evaluation framework for *in situ* cell segmentation results. -->

## simpleSeg: Do my images have a batch effect?

First, let's load in the images we previously segmented out in the last section. The `SpatialDatasets` package conveniently provides the segmented out images for the Ferguson 2022 dataset.

```{r 02-loadFerguson, warning=FALSE, message=FALSE}
# load in segmented data
fergusonSPE <- SpatialDatasets::spe_Ferguson_2022()
```

Next, we can check if the marker intensities of each cell require some form of transformation or normalisation. The reason we do this is two-fold:\

1.  The intensities of images are often highly skewed, preventing any meaningful downstream analysis.\
2.  The intensities across different images are often different, meaning that what is considered "positive" can be different across images.

By transforming and normalising the data, we aim to reduce these two effects. Below, we extract the marker intensities from the `counts` assay and take a closer look at the CD3 marker, which should be expressed in the majority of T cells.

```{r 02-plotCD3Density, fig.width=5, fig.height=5}
# plot densities of CD3 for each image
fergusonSPE |> 
  join_features(features = rownames(fergusonSPE), shape = "wide", assay = "counts") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")
```

::: callout-note
**What we're looking for**

1.  Do the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If we see 3 or more peaks where we don't expect, this might be an indicator that further normalization is required.
2.  Are our CD3+ and CD3- peaks consistent across our images? We want to make sure that our density plots for CD3 are largely the same across images so that a CD3+ cell in 1 image is equivalent to a CD3+ cell in another image.
:::

Here, we can see that the intensities are very clearly skewed, and it is difficult to distinguish a CD3- cell from a CD3+ cell. Further, we can clearly see some image-level batch effect, where across images, the intensity peaks differ drastically.

Another method of visualising batch effects is using a dimensionality reduction technique and visualising how the images separate out on a 2D plot. If no batch effect is expected, we should see the images largely overlap with each other.

```{r 02-plotUMAP, time_it = TRUE}
set.seed(51773)

# specify a subset of informative markers for UMAP and clustering
ct_markers <- c("podoplanin", "CD13", "CD31",
                "panCK", "CD3", "CD4", "CD8a",
                "CD20", "CD68", "CD16", "CD14", 
                "HLADR", "CD66a")

# perform dimension reduction using UMAP
fergusonSPE <- scater::runUMAP(
  fergusonSPE,
  subset_row = ct_markers,
  exprs_values = "counts"
)

# select a subset of images to plot
someImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID
scater::plotReducedDim(
  fergusonSPE[, fergusonSPE$imageID %in% someImages],
  dimred = "UMAP",
  colour_by = "imageID"
)
```

The UMAP also indicates that some level of batch effect exists in our dataset.

`simpleSeg` provides the `normalizeCells` function for correcting image-level batch effects. We specify the following parameters -

-   `transformation` is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the `simpleSeg` function.
-   `method = c("trim99", "mean", PC1")` is an optional argument which specifies the normalisation method(s) to be performed. A comprehensive table of methods is provided below.
-   `assayIn = "counts"` is a required argument which specifies the name of the assay that contains our intensity data. In our context, this is called `counts`.

```{r 02-normalizeCells, fig.width=5, fig.height=5}
# leave out the nuclei markers from our normalisation process
useMarkers <- rownames(fergusonSPE)[!rownames(fergusonSPE) %in% c("DNA1", "DNA2", "HH3")]

# transform and normalise the marker expression of each cell type
fergusonSPE <- normalizeCells(fergusonSPE,
                        markers = useMarkers,
                        transformation = NULL,
                        method = c("trim99", "mean", "PC1"),
                        assayIn = "counts",
                        cores = nCores)
```

This modified data is then stored in the `norm` assay by default, but can be changed using the `assayOut` parameter.

#### `method` Parameters

| Method | Description |   |
|------------------|:---------------------------------:|:----------------:|
| ["mean"]{style="font-family: 'Courier New', monospace;"} | [Divides the marker cellular marker intensities by their mean.]{style="font-family: 'Courier New', monospace;"} |  |
| ["minMax"]{style="font-family: 'Courier New', monospace;"} | [Subtracts the minimum value and scales markers between 0 and 1.]{style="font-family: 'Courier New', monospace;"} |  |
| ["trim99"]{style="font-family: 'Courier New', monospace;"} | [Sets the highest 1% of values to the value of the 99th percentile.\`]{style="font-family: 'Courier New', monospace;"} |  |
| ["PC1"]{style="font-family: 'Courier New', monospace;"} | [Removes the 1st principal component) can be performed with one call of the function, in the order specified by the user.]{style="font-family: 'Courier New', monospace;"} |  |

We can then plot the same density curve for the CD3 marker using the normalised data.

```{r 02-plotCD3normalised}
# plot densities of CD3 for each image
fergusonSPE |> 
  join_features(features = rownames(fergusonSPE), shape = "wide", assay = "norm") |> 
  ggplot(aes(x = CD3, colour = imageID)) + 
  geom_density() + 
  theme(legend.position = "none")
```

::: callout-note
**Questions revisited**

1.  Do the CD3+ and CD3- peaks clearly separate out in the density plot? If not, we can try optimizing the transform if the distribution looks heavily skewed.
2.  Are our CD3+ and CD3- peaks consistent across our images? We can try to be more stringent in our normalization, such as by removing the 1st PC (`method = c(..., "PC1")`) or scaling the values for all images between 0 and 1 (`method = c(..., "minMax")`).
:::

Here, we can see that the normalised data appears more bimodal, and we can clearly observe a CD3+ peak at 5.00, and a CD3- peak at around 3.00. Image-level batch effects also appear to have been mitigated.

We can also visualise the effect of normalisation on the UMAP, which shows that the images now overlap with each other to a much greater extent.

```{r 02-normUMAP, time_it = TRUE}
set.seed(51773)

# perform dimension reduction using UMAP
fergusonSPE <- scater::runUMAP(
  fergusonSPE,
  subset_row = ct_markers,
  exprs_values = "norm",
  name = "normUMAP"
)

someImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]

# UMAP by imageID
scater::plotReducedDim(
  fergusonSPE[, fergusonSPE$imageID %in% someImages],
  dimred = "normUMAP",
  colour_by = "imageID"
)
```

<!-- ## scMerge: Combining multiple spatial datasets -->

<!-- A common question that pops up when analysing spatial datasets is: can I combine multiple spatial datasets? -->

Now that we have completed quality control checks and normalised the expression data to address variability and batch effects, we can proceed to the next step: cell annotation.

## sessionInfo

```{r}
sessionInfo()
```