ClassifyR3.Rmd

---
title: "ClassifyR 3 Figures and Results"
author: "Dario Strbenac"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

The data set which will be used is the METABRIC breast cancer data set. Imaging mass cytometry is a spatial technique which measures about 40 proteins. There is also RNA-seq data which more broadly measures the transcriptome. The dataset is stored on Albona server. Import the omics data. The first five columns contain metadata.

## Global Settings

Plot style.

```{r}
library(ggplot2)
theme_set(theme_bw() + theme(axis.text = element_text(colour = "black"), plot.title = element_text(hjust = 0.5)))
```

Code and result text width.

```{r}
options(width = 140)
```

## Data Import

Imaging mass cytometry.

```{r, include = FALSE}
library(SingleCellExperiment)
```

```{r}
datasetFolder <- "/dski/nobackup/biostat/datasets/spatial/IMC_BreastCancer_metabric_Ali2020"
IMC <- read.csv(file.path(datasetFolder, "Data", "single_cell_data.csv"))
IMC[, "description"] <- factor(IMC[, "description"]) 
IMC <- SingleCellExperiment(colData = DataFrame(IMC[, -(6:44)]),
                            assays = list(counts = t(IMC[, 6:44]))
                           )
IMC
```

The data is in a SingleCellExperiment and has 38 proteins and 479844 cells.

Import the gene expression microarrays.

```{r}
RNAarrays <- read.delim(file.path(datasetFolder, "Data", "brca_metabric",
                        "data_mrna_agilent_microarray.txt"),
                        check.names = FALSE)
RNAarrays[1:5, 1:5]
RNAarrays <- RNAarrays[!duplicated(RNAarrays[, "Hugo_Symbol"]), ]
rownames(RNAarrays) <- RNAarrays[, "Hugo_Symbol"]
RNAarrays <- RNAarrays[, -(1:2)]
RNAarrays <- t(RNAarrays)
RNAarrays[1:5, 1:5]
topHVG <- order(apply(RNAarrays, 2, var), decreasing = TRUE)[1:2000]
RNAarrays <- RNAarrays[, topHVG]
```

There has also been updated clinical data made available for METABRIC patients with longer follow-up times. Import the data and derive the recurrence-free survival variable.

```{r}
clinical <- read.delim(file.path(datasetFolder, "Data", "NIHMS1520488-supplement-Supp_Table_5.txt"),
                       check.names = FALSE)
clinical <- subset(clinical, clinical$Lymph.Nodes.Positive == 0)
clinical$metabricId <- clinical$METABRIC.ID # Column name identical for joining to omics data.
clinical$timeRFS <- apply(clinical[, c("T", "TLR", "TDR")], 1, min)
clinical$eventRFS <- apply(clinical[, c("DeathBreast", "LR", "DR")], 1, max)
rownames(clinical) <- clinical[, "METABRIC.ID"]
usefulFeatures <- c("Breast.Tumour.Laterality", "ER.Status", "Inferred.Menopausal.State",
                    "Grade", "Size", "Stage")
clinical$Breast.Tumour.Laterality <- factor(clinical$Breast.Tumour.Laterality)
clinical$ER.Status <- factor(clinical$ER.Status)
clinical$Inferred.Menopausal.State <- factor(clinical$Inferred.Menopausal.State)
head(clinical[, usefulFeatures])
```

There is a small amount of missing data in the clinical table. Impute it using random forest. There are also two genes with one sample each with a missing value in the RNA abundance table. Use mean imputation because missRanger doesn't work on such a large matrix, due to a limitation on the number of terms which can be in R's formula specification.

```{r}
set.seed(1111)
library(missRanger)
clinical <- missRanger(clinical, . - MATCHED.NORMAL.METABRIC.ID ~ . - METABRIC.ID - MATCHED.NORMAL.METABRIC.ID - Cohort - Date.Of.Diagnosis - Complete.Rec.History - metabricId, pmm.k = 5)
table(is.na(RNAarrays)) # Only ten numbers in the entire matrix. Forest fails.
whichMissing <- which(apply(RNAarrays, 2, anyNA))
RNAarrays[, whichMissing] <- apply(RNAarrays[, whichMissing], 2, function(feature) 
    feature[is.na(feature)] <- mean(feature, na.rm = TRUE))
```

View the distribution of censored and uncensored event times.

```{r}
invisible(by(clinical, clinical[, "eventRFS"], function(samplesByCensored)
{
  message("Samples in Category ", samplesByCensored[1, "eventRFS"],  " : ", nrow(samplesByCensored))
  print(summary(samplesByCensored[, "timeRFS"]))
}))
```

Subset to samples in common to RNA arrays, IMC and clinical data. **Note:** Some patients have two or three images. Arbitrarily pick one image per sample for those that have two or three images.

```{r}
commonIDs <- Reduce(intersect, list(clinical$metabricId, IMC$metabricId, rownames(RNAarrays)))
clinical <- clinical[match(commonIDs, clinical$metabricId), ]
IMC <- IMC[, colData(IMC)$metabricId %in% commonIDs]
RNAarrays <- RNAarrays[match(commonIDs, rownames(RNAarrays)), ]
patientIDToImgID <- by(colData(IMC), colData(IMC)$metabricId, function(patientColData)
                    {
                      unique(patientColData$ImageNumber)[1]
                    })[commonIDs]
IMC <- IMC[, colData(IMC)$ImageNumber %in% patientIDToImgID]
```

## Metafeature Generation

Cell type proportions ignoring the spatial location of cells.

```{r}
library(spicyR)
proportions <- getProp(IMC, "description", "ImageNumber")
rownames(proportions) <- names(patientIDToImgID)[match(rownames(proportions), patientIDToImgID)]
proportions[1:5, 1:5]
```

The row names are image numbers and the column names are cell types.

Calculate statistics for colocalisation between all pairs of cell types.

```{r}
library(BiocParallel)
pairsColocated <- getPairwise(IMC, sigma = 50, BPPARAM = MulticoreParam(16),
                              imageID = "ImageNumber", cellType = "description",
                              spatialCoordCols = c("Location_Center_X", "Location_Center_Y"))
rownames(pairsColocated) <- names(patientIDToImgID)[match(rownames(pairsColocated), patientIDToImgID)]
pairsColocated[1:5, 1:5]
```

Microenvironments by clustering.

```{r}
library(lisaClust)
IMC <- lisaClust(IMC, k = 10, sigma = 50, BPPARAM = MulticoreParam(16),
                 imageID = "ImageNumber", cellType = "description",
                 spatialCoords = c("Location_Center_X", "Location_Center_Y"))
regionsColocal <- getProp(IMC, feature = "region", "ImageNumber") # Extract newly-added information.
rownames(regionsColocal) <- names(patientIDToImgID)[match(rownames(regionsColocal), patientIDToImgID)]
regionsColocal[1:5, 1:5]
```

Average highly-variable gene expression per cell type.

```{r}
library(scFeatures)
colData(IMC)$x_cord <- colData(IMC)$Location_Center_X
colData(IMC)$y_cord <- colData(IMC)$Location_Center_Y
HVGaverages <- scFeatures(assay(IMC, "counts"), feature_types = "gene_mean_celltype", type = "spatial_p",
                  sample = colData(IMC)$ImageNumber,
                  celltype = colData(IMC)$description,
                  ncores = 16)[["gene_mean_celltype"]]
rownames(HVGaverages) <- names(patientIDToImgID)[match(rownames(HVGaverages), patientIDToImgID)]
HVGaverages[1:5, 1:5]
```

Proportions relative to a parent cell type.

```{r}
library(treekoR)

tree <- getClusterTree(t(assay(IMC, "counts")), colData(IMC)$description,
                       hierarchy_method = "hopach")

proportionsParent <- getCellProp(tree$clust_tree, colData(IMC)$description, colData(IMC)$metabricId,
                                 colData(IMC)$metabricId)
rownames(proportionsParent) <- proportionsParent$sample_id
proportionsParent[is.na(proportionsParent)] <- 0
proportionsParent <- proportionsParent[rownames(proportions), -c(1:2)]
proportionsParent <- proportionsParent[, grep("parent", colnames(proportionsParent))]
proportionsParent[1:5, 1:5]
```

## Survival Time Modelling and C-index Evaluation

Create a list of all data tables.

```{r}
measurements <- list(clinical = clinical, `RNA Microarray` = RNAarrays, `Type Proportions` = proportions,
                     `Type Protein Mean` = HVGaverages, `Type Pairs Colocated` = pairsColocated,
                     `Colocated in Regions` = regionsColocal, `Proportion of Parent` = proportionsParent)
sapply(measurements, dim)
```

Cox survival model for each assay individually.

```{r, include = FALSE}
library(ClassifyR)
```

```{r, fig.width = 12}
orderAssays <- c("clinical", "Type Proportions", "RNA Microarray",
                 "Proportion of Parent", "Colocated in Regions", "Type Protein Mean", "Type Pairs Colocated")
nFeatures = list(clinical = 1:3, `RNA Microarray` = c(5, 10, 20, 50), `Type Proportions` = 1:5,
                 `Proportion of Parent` = 1:5, `Colocated in Regions` = 1:5,
                 `Type Protein Mean` = 1:5, `Type Pairs Colocated` = 1:5)

coxPredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
                             extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
                                                select = list(tuneParams = list(nFeatures = nFeatures)),
                                                tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
                             selectionMethod = "CoxPH", classifier = "CoxPH", nCores = 20, verbose = 1)
performancePlot(coxPredicts, orderingList = list(`Assay Name` = orderAssays)) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + ggtitle("Concordance of Individual Assays")
```

Evaluate the features being repeatedly chosen for the well-performing clinical data.

```{r}
table(unlist(do.call(rbind, chosenFeatureNames(coxPredicts[["clinical.CoxPH.CoxPH"]]))[, "feature"]))
```

Grade, tumour size and stage are all frequently chosen.

Cox survival model for concatenation of assays.

```{r, fig.width = 12}
coxMergePredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
                                  extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
                                                select = list(tuneParams = list(nFeatures = nFeatures)),
                                                tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
                                  multiViewMethod = "merge",
                                  selectionMethod = "CoxPH", classifier = "CoxPH", nRepeats = 20, nCores = 20, verbose = 1)
performancePlot(coxMergePredicts, orderingList = list("Assay Name" = "performanceDescending")) +
  theme(axis.text.x = element_text(angle = 45, size = 6, vjust = 1, hjust = 1)) + ggtitle("METABRIC All Merges Comparison")
```

Clinical, Type Protein Mean features.

```{r}
mergeFeatureCounts <- table(unlist(do.call(rbind, chosenFeatureNames(coxMergePredicts[[10]]))[, "feature"]))
mergeFeatureCounts <- sort(mergeFeatureCounts, decreasing = TRUE)
head(mergeFeatureCounts, 10)
```

In addition to the clinical data, Ki67 is a repeatedly selected omics feature. According to Wikipedia, "Ki-67 is an excellent marker to determine the growth fraction of a given cell population. The fraction of Ki-67-positive tumor cells (the Ki-67 labeling index) is often correlated with the clinical course of cancer.". So, it is an unsurprising and to-be-expected predictor variable.

Cox survival model for prevalidation (concatenate each assay as one column to the clinical data).

```{r, fig.width = 12}
coxPrevalPredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
                     extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
                                                select = list(tuneParams = list(nFeatures = nFeatures)),
                                                tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
                     multiViewMethod = "prevalidation", nRepeats = 20, nCores = 20,
                     selectionMethod = "CoxPH", classifier = "CoxPH", verbose = 1)
performancePlot(coxPrevalPredicts) + ggtitle("METABRIC Prevalidation Comparison")
```

Cox survival model for Principal Components Analysis dimensionality reduction (concatenate each principal component as one column to the clinical data).

```{r, fig.width = 12}
nFeatures = list(clinical = 1:3, `RNA Microarray` = 2, `Type Proportions` = 2,
                 `Proportion of Parent` = 2, `Colocated in Regions` = 2,
                 `Type Protein Mean` = 2, `Type Pairs Colocated` = 2)
PCApredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
               multiViewMethod = "PCA", nRepeats = 20, nCores = 20,
               extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
                                                select = list(tuneParams = list(nFeatures = nFeatures)),
                                                tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
               selectionMethod = "CoxPH", classifier = "CoxPH", verbose = 1)
performancePlot(PCApredicts) + ggtitle("METABRIC PCA Comparison")
```

## Conclusion

Clinical data alone is the best or nearly the best. Lymph nodes invaded and tumour diameter are important for patient survival.