forked from AlexsLemonade/OpenPBTA-analysis
-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path01-add-ploidy-cnvkit.Rmd
82 lines (67 loc) · 3.17 KB
/
01-add-ploidy-cnvkit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "Add ploidy column, status to CNVkit output"
output: html_notebook
author: J. Taroni for ALSF CCDL
date: 2019
---
The `histologies.tsv` file contains a `tumor_ploidy` column, which is tumor ploidy as inferred by ControlFreeC.
The copy number information should be interpreted in the light of this information (see: [current version of Data Formats section of README](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/390f1e08e481da5ec0b2c62d886d5fd298bbf017#data-formats)).
This notebook adds ploidy information to the CNVkit results and adds a status column that defines gain and loss broadly.
```{r}
library(dplyr)
```
### Read in data
```{r}
cnvkit_file <- file.path("..", "..", "data", "cnv-cnvkit.seg.gz")
cnvkit_df <- readr::read_tsv(cnvkit_file)
```
```{r}
histologies_file <- file.path("..", "..", "data", "histologies.tsv")
histologies_df <- readr::read_tsv(histologies_file, guess_max = 100000)
```
### Add inferred ploidy information to CNVkit results
```{r}
add_ploidy_df <- histologies_df %>%
select(Kids_First_Biospecimen_ID, tumor_ploidy, germline_sex_estimate) %>%
inner_join(cnvkit_df, by = c("Kids_First_Biospecimen_ID" = "ID")) %>%
select(-tumor_ploidy, -germline_sex_estimate, everything())
```
### Add status column
This is intended to mirror the information contained in the ControlFreeC output.
```{r}
add_autosomes_df <- add_ploidy_df %>%
# x and y chromosomes must be handled differently
filter(!(chrom %in% c("chrX", "chrY"))) %>%
mutate(status = case_when(
# when the copy number is less than inferred ploidy, mark this as a loss
copy.num < tumor_ploidy ~ "loss",
# if copy number is higher than ploidy, mark as a gain
copy.num > tumor_ploidy ~ "gain",
copy.num == tumor_ploidy ~ "neutral"
))
# this logic is consistent with what is observed in the controlfreec file
# specifically, in samples where germline sex estimate = Female, X chromosomes
# appear to be treated the same way as autosomes
add_xy_df <- add_ploidy_df %>%
filter(chrom %in% c("chrX", "chrY")) %>%
mutate(status = case_when(
germline_sex_estimate == "Male" & copy.num < (0.5 * tumor_ploidy) ~ "loss",
germline_sex_estimate == "Male" & copy.num > (0.5 * tumor_ploidy) ~ "gain",
germline_sex_estimate == "Male" & copy.num == (0.5 * tumor_ploidy) ~ "neutral",
# there are some instances where there are chromosome Y segments are in
# samples where the germline_sex_estimate is Female
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num > 0 ~ "unknown",
chrom == "chrY" & germline_sex_estimate == "Female" & copy.num == 0 ~ "neutral",
# mirroring controlfreec, X treated same as autosomes
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num < tumor_ploidy ~ "loss",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num > tumor_ploidy ~ "gain",
chrom == "chrX" & germline_sex_estimate == "Female" & copy.num == tumor_ploidy ~ "neutral"
))
add_status_df <- dplyr::bind_rows(add_autosomes_df, add_xy_df) %>%
dplyr::select(-germline_sex_estimate)
```
### Write to `scratch`
```{r}
output_file <- file.path("..", "..", "scratch", "cnvkit_with_status.tsv")
readr::write_tsv(add_status_df, output_file)
```