-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathClassifyR3.Rmd
275 lines (219 loc) · 12.5 KB
/
ClassifyR3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "ClassifyR 3 Figures and Results"
author: "Dario Strbenac"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
The data set which will be used is the METABRIC breast cancer data set. Imaging mass cytometry is a spatial technique which measures about 40 proteins. There is also RNA-seq data which more broadly measures the transcriptome. The dataset is stored on Albona server. Import the omics data. The first five columns contain metadata.
## Global Settings
Plot style.
```{r}
library(ggplot2)
theme_set(theme_bw() + theme(axis.text = element_text(colour = "black"), plot.title = element_text(hjust = 0.5)))
```
Code and result text width.
```{r}
options(width = 140)
```
## Data Import
Imaging mass cytometry.
```{r, include = FALSE}
library(SingleCellExperiment)
```
```{r}
datasetFolder <- "/dski/nobackup/biostat/datasets/spatial/IMC_BreastCancer_metabric_Ali2020"
IMC <- read.csv(file.path(datasetFolder, "Data", "single_cell_data.csv"))
IMC[, "description"] <- factor(IMC[, "description"])
IMC <- SingleCellExperiment(colData = DataFrame(IMC[, -(6:44)]),
assays = list(counts = t(IMC[, 6:44]))
)
IMC
```
The data is in a SingleCellExperiment and has 38 proteins and 479844 cells.
Import the gene expression microarrays.
```{r}
RNAarrays <- read.delim(file.path(datasetFolder, "Data", "brca_metabric",
"data_mrna_agilent_microarray.txt"),
check.names = FALSE)
RNAarrays[1:5, 1:5]
RNAarrays <- RNAarrays[!duplicated(RNAarrays[, "Hugo_Symbol"]), ]
rownames(RNAarrays) <- RNAarrays[, "Hugo_Symbol"]
RNAarrays <- RNAarrays[, -(1:2)]
RNAarrays <- t(RNAarrays)
RNAarrays[1:5, 1:5]
topHVG <- order(apply(RNAarrays, 2, var), decreasing = TRUE)[1:2000]
RNAarrays <- RNAarrays[, topHVG]
```
There has also been updated clinical data made available for METABRIC patients with longer follow-up times. Import the data and derive the recurrence-free survival variable.
```{r}
clinical <- read.delim(file.path(datasetFolder, "Data", "NIHMS1520488-supplement-Supp_Table_5.txt"),
check.names = FALSE)
clinical <- subset(clinical, clinical$Lymph.Nodes.Positive == 0)
clinical$metabricId <- clinical$METABRIC.ID # Column name identical for joining to omics data.
clinical$timeRFS <- apply(clinical[, c("T", "TLR", "TDR")], 1, min)
clinical$eventRFS <- apply(clinical[, c("DeathBreast", "LR", "DR")], 1, max)
rownames(clinical) <- clinical[, "METABRIC.ID"]
usefulFeatures <- c("Breast.Tumour.Laterality", "ER.Status", "Inferred.Menopausal.State",
"Grade", "Size", "Stage")
clinical$Breast.Tumour.Laterality <- factor(clinical$Breast.Tumour.Laterality)
clinical$ER.Status <- factor(clinical$ER.Status)
clinical$Inferred.Menopausal.State <- factor(clinical$Inferred.Menopausal.State)
head(clinical[, usefulFeatures])
```
There is a small amount of missing data in the clinical table. Impute it using random forest. There are also two genes with one sample each with a missing value in the RNA abundance table. Use mean imputation because missRanger doesn't work on such a large matrix, due to a limitation on the number of terms which can be in R's formula specification.
```{r}
set.seed(1111)
library(missRanger)
clinical <- missRanger(clinical, . - MATCHED.NORMAL.METABRIC.ID ~ . - METABRIC.ID - MATCHED.NORMAL.METABRIC.ID - Cohort - Date.Of.Diagnosis - Complete.Rec.History - metabricId, pmm.k = 5)
table(is.na(RNAarrays)) # Only ten numbers in the entire matrix. Forest fails.
whichMissing <- which(apply(RNAarrays, 2, anyNA))
RNAarrays[, whichMissing] <- apply(RNAarrays[, whichMissing], 2, function(feature)
feature[is.na(feature)] <- mean(feature, na.rm = TRUE))
```
View the distribution of censored and uncensored event times.
```{r}
invisible(by(clinical, clinical[, "eventRFS"], function(samplesByCensored)
{
message("Samples in Category ", samplesByCensored[1, "eventRFS"], " : ", nrow(samplesByCensored))
print(summary(samplesByCensored[, "timeRFS"]))
}))
```
Subset to samples in common to RNA arrays, IMC and clinical data. **Note:** Some patients have two or three images. Arbitrarily pick one image per sample for those that have two or three images.
```{r}
commonIDs <- Reduce(intersect, list(clinical$metabricId, IMC$metabricId, rownames(RNAarrays)))
clinical <- clinical[match(commonIDs, clinical$metabricId), ]
IMC <- IMC[, colData(IMC)$metabricId %in% commonIDs]
RNAarrays <- RNAarrays[match(commonIDs, rownames(RNAarrays)), ]
patientIDToImgID <- by(colData(IMC), colData(IMC)$metabricId, function(patientColData)
{
unique(patientColData$ImageNumber)[1]
})[commonIDs]
IMC <- IMC[, colData(IMC)$ImageNumber %in% patientIDToImgID]
```
## Metafeature Generation
Cell type proportions ignoring the spatial location of cells.
```{r}
library(spicyR)
proportions <- getProp(IMC, "description", "ImageNumber")
rownames(proportions) <- names(patientIDToImgID)[match(rownames(proportions), patientIDToImgID)]
proportions[1:5, 1:5]
```
The row names are image numbers and the column names are cell types.
Calculate statistics for colocalisation between all pairs of cell types.
```{r}
library(BiocParallel)
pairsColocated <- getPairwise(IMC, sigma = 50, BPPARAM = MulticoreParam(16),
imageID = "ImageNumber", cellType = "description",
spatialCoordCols = c("Location_Center_X", "Location_Center_Y"))
rownames(pairsColocated) <- names(patientIDToImgID)[match(rownames(pairsColocated), patientIDToImgID)]
pairsColocated[1:5, 1:5]
```
Microenvironments by clustering.
```{r}
library(lisaClust)
IMC <- lisaClust(IMC, k = 10, sigma = 50, BPPARAM = MulticoreParam(16),
imageID = "ImageNumber", cellType = "description",
spatialCoords = c("Location_Center_X", "Location_Center_Y"))
regionsColocal <- getProp(IMC, feature = "region", "ImageNumber") # Extract newly-added information.
rownames(regionsColocal) <- names(patientIDToImgID)[match(rownames(regionsColocal), patientIDToImgID)]
regionsColocal[1:5, 1:5]
```
Average highly-variable gene expression per cell type.
```{r}
library(scFeatures)
colData(IMC)$x_cord <- colData(IMC)$Location_Center_X
colData(IMC)$y_cord <- colData(IMC)$Location_Center_Y
HVGaverages <- scFeatures(assay(IMC, "counts"), feature_types = "gene_mean_celltype", type = "spatial_p",
sample = colData(IMC)$ImageNumber,
celltype = colData(IMC)$description,
ncores = 16)[["gene_mean_celltype"]]
rownames(HVGaverages) <- names(patientIDToImgID)[match(rownames(HVGaverages), patientIDToImgID)]
HVGaverages[1:5, 1:5]
```
Proportions relative to a parent cell type.
```{r}
library(treekoR)
tree <- getClusterTree(t(assay(IMC, "counts")), colData(IMC)$description,
hierarchy_method = "hopach")
proportionsParent <- getCellProp(tree$clust_tree, colData(IMC)$description, colData(IMC)$metabricId,
colData(IMC)$metabricId)
rownames(proportionsParent) <- proportionsParent$sample_id
proportionsParent[is.na(proportionsParent)] <- 0
proportionsParent <- proportionsParent[rownames(proportions), -c(1:2)]
proportionsParent <- proportionsParent[, grep("parent", colnames(proportionsParent))]
proportionsParent[1:5, 1:5]
```
## Survival Time Modelling and C-index Evaluation
Create a list of all data tables.
```{r}
measurements <- list(clinical = clinical, `RNA Microarray` = RNAarrays, `Type Proportions` = proportions,
`Type Protein Mean` = HVGaverages, `Type Pairs Colocated` = pairsColocated,
`Colocated in Regions` = regionsColocal, `Proportion of Parent` = proportionsParent)
sapply(measurements, dim)
```
Cox survival model for each assay individually.
```{r, include = FALSE}
library(ClassifyR)
```
```{r, fig.width = 12}
orderAssays <- c("clinical", "Type Proportions", "RNA Microarray",
"Proportion of Parent", "Colocated in Regions", "Type Protein Mean", "Type Pairs Colocated")
nFeatures = list(clinical = 1:3, `RNA Microarray` = c(5, 10, 20, 50), `Type Proportions` = 1:5,
`Proportion of Parent` = 1:5, `Colocated in Regions` = 1:5,
`Type Protein Mean` = 1:5, `Type Pairs Colocated` = 1:5)
coxPredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
select = list(tuneParams = list(nFeatures = nFeatures)),
tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
selectionMethod = "CoxPH", classifier = "CoxPH", nCores = 20, verbose = 1)
performancePlot(coxPredicts, orderingList = list(`Assay Name` = orderAssays)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + ggtitle("Concordance of Individual Assays")
```
Evaluate the features being repeatedly chosen for the well-performing clinical data.
```{r}
table(unlist(do.call(rbind, chosenFeatureNames(coxPredicts[["clinical.CoxPH.CoxPH"]]))[, "feature"]))
```
Grade, tumour size and stage are all frequently chosen.
Cox survival model for concatenation of assays.
```{r, fig.width = 12}
coxMergePredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
select = list(tuneParams = list(nFeatures = nFeatures)),
tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
multiViewMethod = "merge",
selectionMethod = "CoxPH", classifier = "CoxPH", nRepeats = 20, nCores = 20, verbose = 1)
performancePlot(coxMergePredicts, orderingList = list("Assay Name" = "performanceDescending")) +
theme(axis.text.x = element_text(angle = 45, size = 6, vjust = 1, hjust = 1)) + ggtitle("METABRIC All Merges Comparison")
```
Clinical, Type Protein Mean features.
```{r}
mergeFeatureCounts <- table(unlist(do.call(rbind, chosenFeatureNames(coxMergePredicts[[10]]))[, "feature"]))
mergeFeatureCounts <- sort(mergeFeatureCounts, decreasing = TRUE)
head(mergeFeatureCounts, 10)
```
In addition to the clinical data, Ki67 is a repeatedly selected omics feature. According to Wikipedia, "Ki-67 is an excellent marker to determine the growth fraction of a given cell population. The fraction of Ki-67-positive tumor cells (the Ki-67 labeling index) is often correlated with the clinical course of cancer.". So, it is an unsurprising and to-be-expected predictor variable.
Cox survival model for prevalidation (concatenate each assay as one column to the clinical data).
```{r, fig.width = 12}
coxPrevalPredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
select = list(tuneParams = list(nFeatures = nFeatures)),
tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
multiViewMethod = "prevalidation", nRepeats = 20, nCores = 20,
selectionMethod = "CoxPH", classifier = "CoxPH", verbose = 1)
performancePlot(coxPrevalPredicts) + ggtitle("METABRIC Prevalidation Comparison")
```
Cox survival model for Principal Components Analysis dimensionality reduction (concatenate each principal component as one column to the clinical data).
```{r, fig.width = 12}
nFeatures = list(clinical = 1:3, `RNA Microarray` = 2, `Type Proportions` = 2,
`Proportion of Parent` = 2, `Colocated in Regions` = 2,
`Type Protein Mean` = 2, `Type Pairs Colocated` = 2)
PCApredicts <- crossValidate(measurements, c("timeRFS", "eventRFS"),
multiViewMethod = "PCA", nRepeats = 20, nCores = 20,
extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures)),
select = list(tuneParams = list(nFeatures = nFeatures)),
tuneCross = list(performanceType = "auto", tuneMode = "Resubstitution")),
selectionMethod = "CoxPH", classifier = "CoxPH", verbose = 1)
performancePlot(PCApredicts) + ggtitle("METABRIC PCA Comparison")
```
## Conclusion
Clinical data alone is the best or nearly the best. Lymph nodes invaded and tumour diameter are important for patient survival.