Generate merged RNA isoform files #341

runjin326 · 2022-06-02T20:15:20Z

What data file(s) does this issue pertain to?

Generate a merged RNA isoform tile for v11 data release

What release are you using?

v11

Put your question or report your issue here.

Please merge all the RNA isoform files to one (maybe call it rna-isoform.tsv?) and add to v11.
The RNA BS ids are here.

v11_rna_bsids.txt

The text was updated successfully, but these errors were encountered:

jharenza · 2022-06-03T12:58:04Z

@zhangb1 can someone from your team do this merge today? I also haven't checked the files, but presumably this merge file would have the ENSEMBL transcript ids as well as the gene id, correct?

zhangb1 · 2022-06-03T14:15:21Z

@jharenza , I just came back from vacation today.... I will manager that. won't be finished by today though.

jharenza · 2022-06-03T14:29:32Z

Ok. Can it be done by monday? Also, do you know if the merges are a "set it and forget it" type of thing at this point (eg the files aren't needing to be downloaded)? If still downloading, maybe we can brainstorm to make this more real-time push of a button? Cc @yuankunzhu

zhangb1 · 2022-06-03T14:48:52Z

@jharenza , we used the code from github repo to merge the RSEM gene files:
https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/dev/analyses/collapse-rnaseq/00-create-rsem-files.R

Also we made the cwl tool based on that to merge the monthly files : https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/apps/#d3b-bixu-ops/tumor-broad-monthly-release/merge-rsem-gene/0

Do you think we can have the R script somewhere for merging the isoform? so we can make the cwl tool to merge monthly like RSEM gene files?

jharenza · 2022-06-03T16:10:20Z

We don't have one for isoform. In the past, for openPBTA, I think bix eng had done a merge. We had not included that yet for OpenPedCan, but would like to now.

jharenza · 2022-06-03T16:12:13Z

@zhangb1 maybe @kgaonkar6 had done this in the past- I found this code: https://github.com/d3b-center/D3b-codes/blob/master/OpenPBTA_v20_release_QC/code/00-create-and-add-rsem-isoforms-files.R

But this also entails a download, which I think we want to get away from so this can be more automated.

zhangb1 · 2022-06-03T17:43:39Z

@HuangXiaoyan0106 ^ maybe you can check the R script Jo Lynne post above?

Seems very similar to the rsem gene one..

HuangXiaoyan0106 · 2022-06-05T13:38:03Z

@zhangb1 @jharenza
Have generated the merged RNA isoforms.
https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/tasks/bd92b68b-51b5-45f8-8c58-0da30ba03b5d/

jharenza · 2022-06-06T12:58:51Z

Thank you @HuangXiaoyan0106 !! @runjin326 can you check this file and add to s3 please

runjin326 · 2022-06-06T13:06:09Z

Sure - there are 3 files:

Could you please inform which one is the one to use or do we want to release all 3 of them?

jharenza · 2022-06-06T15:12:05Z

@afarrel can you inform please

afarrel · 2022-06-07T14:31:07Z

TPM

runjin326 · 2022-06-07T19:09:08Z

@jharenza - TPM file is uploaded + md5sum updated.

jharenza · 2022-06-17T16:56:09Z

@zhangb1 @jharenza Have generated the merged RNA isoforms. https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/tasks/bd92b68b-51b5-45f8-8c58-0da30ba03b5d/

@HuangXiaoyan0106 can you update your code to only add ENST id to the transcript_id column and not merge those with gene symbols and then regenerate this matrix? Alternatively for this once, you can remove the gene symbols, but word of caution, there will have to be two splits because of _PAR_Y_ pseudoautosomal regions, which are appended in between some ENST and gene symbols. Example code to get rid of these would be:

# split gene id and symbol
isoforms <- isoforms %>% 
  mutate(transcript_id = str_replace(transcript_id, "_PAR_Y_", "_"))  %>%
  separate(transcript_id, c("transcript_id", "gene_symbol"), sep = "\\_", extra = "merge") %>%
  unique()

and then we would remove the gene symbol column completely.

Can you update this matrix when you have a chance please? cc @zhangb1 @afarrel @sangeetashukla

HuangXiaoyan0106 · 2022-06-20T02:54:52Z

@HuangXiaoyan0106 can you update your code to only add ENST id to the transcript_id column and not merge those with gene symbols and then regenerate this matrix? Alternatively for this once, you can remove the gene symbols, but word of caution, there will have to be two splits because of _PAR_Y_ pseudoautosomal regions, which are appended in between some ENST and gene symbols. Example code to get rid of these would be:

@jharenza I only merged them with transcript_id, and the merged isoform format is like this

> head(merged_isoform)
                  transcript_id BS_05CHNS17 BS_0AK4F99X
1    ENST00000000233.9_ARF5-201       12.07       19.76
2    ENST00000000412.7_M6PR-201       13.35        6.41
3  ENST00000000442.10_ESRRA-201        3.66        0.17

And here is my script:

# load libraries
suppressPackageStartupMessages(library(optparse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(stringr))

# read params
option_list <- list(
  make_option(c("--rsemlist"),
    type = "character",
    help = "list of RSEM files"
  ),
  make_option(c("--manifest"),
    type = "character",
    help = "Manifest file "
  ),
  make_option(c("--outdir"),
    type = "character",
    help = "Path to output directory"
  ),
  make_option(c("-o", "--outname_prefix"),
    type = "character",
    help = "outname prefix"
  )
)

# parse the parameters
opt <- parse_args(OptionParser(option_list = option_list))
resemdir <- opt$rsemlist
manifest <- opt$manifest
outdir <- opt$outdir
outname_prefix <- opt$outname_prefix

# read manifest file
manifest <- read.table(manifest, sep=",",head = TRUE)
manifest <- manifest %>%
    select(biospecimen_id, file_name) %>% dplyr::rename(Kids_First_Biospecimen_ID = biospecimen_id, name = file_name)


manifest <- manifest[grep("[.]rsem[.]isoforms[.]results[.]gz", manifest$name), ]

# read and merge RSEM isoform files
read.rsem <- function(x) {
  dat <- data.table::fread(file.path(resemdir, x))
  filename <- x
  sample.id <- manifest[which(manifest$name %in% filename), "Kids_First_Biospecimen_ID"]
  dat$Sample <- sample.id
  return(dat)
}

lfiles <- list.files(path = resemdir, pattern = "*.rsem.isoforms.results.gz", recursive = TRUE)
expr <- lapply(lfiles, read.rsem)
expr <- data.table::rbindlist(expr)

expr.counts <- dcast(expr, transcript_id ~ Sample, value.var = "expected_count") %>%
  # counts
  as.data.frame()
expr.tpm <- dcast(expr, transcript_id ~ Sample, value.var = "TPM") %>%
  # TPM
  as.data.frame()
expr.fpkm <- dcast(expr, transcript_id ~ Sample, value.var = "FPKM") %>%
  # TPM
  as.data.frame()

saveRDS(expr.counts, file = file.path(outdir, paste0(outname_prefix, "-isoform-counts-rsem-expected_count.rds")))
saveRDS(expr.tpm, file = file.path(outdir, paste0(outname_prefix, "-isoform-expression-rsem-tpm.rds")))
saveRDS(expr.fpkm, file = file.path(outdir, paste0(outname_prefix, "-isoform-expression-rsem-fpkm.rds")))

print("Done!")

jharenza · 2022-06-20T12:37:15Z

Ah, then we would like to remove gene symbols. Can you use the above code to remove those and only keep ENST ids please?

HuangXiaoyan0106 · 2022-06-20T13:37:57Z

Ah, then we would like to remove gene symbols. Can you use the above code to remove those and only keep ENST ids please?

Ah? The merged isoform file doesn't contain the gene symbols, the format is as follows.

> head(merged_isoform)
                  transcript_id BS_05CHNS17 BS_0AK4F99X
1    ENST00000000233.9_ARF5-201       12.07       19.76
2    ENST00000000412.7_M6PR-201       13.35        6.41
3  ENST00000000442.10_ESRRA-201        3.66        0.17

afarrel · 2022-06-20T13:49:38Z

Can there be a separate column for gene symbols?

jharenza · 2022-06-20T14:26:31Z

@HuangXiaoyan0106 please hold on this until we regroup

jharenza · 2022-06-21T16:39:59Z

@HuangXiaoyan0106 will you please add the following to your code, which will split ENSEMBL trancript ids (ENST ids) and the Hugo Gene Symbols) and then also de-duplicate those in pseudoautosomal regions (PAR labels).

# split gene id and symbol
merged_isoform <- merged_isoform %>% 
  mutate(transcript_id = str_replace(transcript_id, "_PAR_Y_", "_"))  %>%
  separate(transcript_id, c("transcript_id", "gene_symbol"), sep = "\\_", extra = "merge") %>%
  unique()

Also, if it is easier, we only need the tpm file.

Thank you

HuangXiaoyan0106 · 2022-06-22T07:02:42Z

@jharenza The task has been completed as per your request.
https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/tasks/d3951a90-ee38-4d91-a470-70403a4a1a12/

runjin326 · 2022-06-24T18:44:03Z

@jharenza - I checked the file, it is as expected with transcript and gene separated. Also _PAR_Y_ were removed. I uploaded the file to S3 and updated the md5sum.txt in the s3 as well.

jharenza · 2022-07-08T17:59:07Z

Closed with d3b-center/OpenPedCan-analysis#188

runjin326 mentioned this issue Jun 2, 2022

v11 release #239

Closed

30 tasks

jharenza assigned zhangb1, HuangXiaoyan0106 and bmennis Jun 2, 2022

jharenza added the v11 label Jun 2, 2022

jharenza added the high priority label Jun 3, 2022

jharenza added the ready label Jun 8, 2022

jharenza added the bix-ops label Jun 17, 2022

zhangb1 removed their assignment Jun 23, 2022

zhangb1 unassigned bmennis Jun 23, 2022

HuangXiaoyan0106 removed their assignment Jun 30, 2022

jharenza mentioned this issue Jul 7, 2022

V11 Release (1/N) d3b-center/OpenPedCan-analysis#188

Merged

5 tasks

jharenza closed this as completed Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate merged RNA isoform files #341

Generate merged RNA isoform files #341

runjin326 commented Jun 2, 2022

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022 •

edited

Loading

jharenza commented Jun 3, 2022

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022

HuangXiaoyan0106 commented Jun 5, 2022

jharenza commented Jun 6, 2022

runjin326 commented Jun 6, 2022

jharenza commented Jun 6, 2022

afarrel commented Jun 7, 2022

runjin326 commented Jun 7, 2022

jharenza commented Jun 17, 2022

HuangXiaoyan0106 commented Jun 20, 2022

jharenza commented Jun 20, 2022

HuangXiaoyan0106 commented Jun 20, 2022

afarrel commented Jun 20, 2022

jharenza commented Jun 20, 2022

jharenza commented Jun 21, 2022

HuangXiaoyan0106 commented Jun 22, 2022

runjin326 commented Jun 24, 2022

jharenza commented Jul 8, 2022

Generate merged RNA isoform files #341

Generate merged RNA isoform files #341

Comments

runjin326 commented Jun 2, 2022

What data file(s) does this issue pertain to?

What release are you using?

Put your question or report your issue here.

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022 • edited Loading

jharenza commented Jun 3, 2022

jharenza commented Jun 3, 2022

zhangb1 commented Jun 3, 2022

HuangXiaoyan0106 commented Jun 5, 2022

jharenza commented Jun 6, 2022

runjin326 commented Jun 6, 2022

jharenza commented Jun 6, 2022

afarrel commented Jun 7, 2022

runjin326 commented Jun 7, 2022

jharenza commented Jun 17, 2022

HuangXiaoyan0106 commented Jun 20, 2022

jharenza commented Jun 20, 2022

HuangXiaoyan0106 commented Jun 20, 2022

afarrel commented Jun 20, 2022

jharenza commented Jun 20, 2022

jharenza commented Jun 21, 2022

HuangXiaoyan0106 commented Jun 22, 2022

runjin326 commented Jun 24, 2022

jharenza commented Jul 8, 2022

zhangb1 commented Jun 3, 2022 •

edited

Loading