Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Generate merged RNA isoform files #341

Closed
runjin326 opened this issue Jun 2, 2022 · 23 comments
Closed

Generate merged RNA isoform files #341

runjin326 opened this issue Jun 2, 2022 · 23 comments

Comments

@runjin326
Copy link

What data file(s) does this issue pertain to?

Generate a merged RNA isoform tile for v11 data release

What release are you using?

v11

Put your question or report your issue here.

Please merge all the RNA isoform files to one (maybe call it rna-isoform.tsv?) and add to v11.
The RNA BS ids are here.

v11_rna_bsids.txt

@jharenza
Copy link
Collaborator

jharenza commented Jun 3, 2022

@zhangb1 can someone from your team do this merge today? I also haven't checked the files, but presumably this merge file would have the ENSEMBL transcript ids as well as the gene id, correct?

@zhangb1
Copy link

zhangb1 commented Jun 3, 2022

@jharenza , I just came back from vacation today.... I will manager that. won't be finished by today though.

@jharenza
Copy link
Collaborator

jharenza commented Jun 3, 2022

Ok. Can it be done by monday? Also, do you know if the merges are a "set it and forget it" type of thing at this point (eg the files aren't needing to be downloaded)? If still downloading, maybe we can brainstorm to make this more real-time push of a button? Cc @yuankunzhu

@zhangb1
Copy link

zhangb1 commented Jun 3, 2022

@jharenza , we used the code from github repo to merge the RSEM gene files:
https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/dev/analyses/collapse-rnaseq/00-create-rsem-files.R

Also we made the cwl tool based on that to merge the monthly files : https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/apps/#d3b-bixu-ops/tumor-broad-monthly-release/merge-rsem-gene/0

Do you think we can have the R script somewhere for merging the isoform? so we can make the cwl tool to merge monthly like RSEM gene files?

@jharenza
Copy link
Collaborator

jharenza commented Jun 3, 2022

We don't have one for isoform. In the past, for openPBTA, I think bix eng had done a merge. We had not included that yet for OpenPedCan, but would like to now.

@jharenza
Copy link
Collaborator

jharenza commented Jun 3, 2022

@zhangb1 maybe @kgaonkar6 had done this in the past- I found this code: https://github.com/d3b-center/D3b-codes/blob/master/OpenPBTA_v20_release_QC/code/00-create-and-add-rsem-isoforms-files.R

But this also entails a download, which I think we want to get away from so this can be more automated.

@zhangb1
Copy link

zhangb1 commented Jun 3, 2022

@HuangXiaoyan0106 ^ maybe you can check the R script Jo Lynne post above?

Seems very similar to the rsem gene one..

@HuangXiaoyan0106
Copy link

@jharenza
Copy link
Collaborator

jharenza commented Jun 6, 2022

Thank you @HuangXiaoyan0106 !! @runjin326 can you check this file and add to s3 please

@runjin326
Copy link
Author

Sure - there are 3 files:
image
Could you please inform which one is the one to use or do we want to release all 3 of them?

@jharenza
Copy link
Collaborator

jharenza commented Jun 6, 2022

@afarrel can you inform please

@afarrel
Copy link

afarrel commented Jun 7, 2022

TPM

@runjin326
Copy link
Author

@jharenza - TPM file is uploaded + md5sum updated.

@jharenza jharenza added the ready label Jun 8, 2022
@jharenza
Copy link
Collaborator

@zhangb1 @jharenza Have generated the merged RNA isoforms. https://cavatica.sbgenomics.com/u/d3b-bixu-ops/tumor-broad-monthly-release/tasks/bd92b68b-51b5-45f8-8c58-0da30ba03b5d/

@HuangXiaoyan0106 can you update your code to only add ENST id to the transcript_id column and not merge those with gene symbols and then regenerate this matrix? Alternatively for this once, you can remove the gene symbols, but word of caution, there will have to be two splits because of _PAR_Y_ pseudoautosomal regions, which are appended in between some ENST and gene symbols. Example code to get rid of these would be:

# split gene id and symbol
isoforms <- isoforms %>% 
  mutate(transcript_id = str_replace(transcript_id, "_PAR_Y_", "_"))  %>%
  separate(transcript_id, c("transcript_id", "gene_symbol"), sep = "\\_", extra = "merge") %>%
  unique() 

and then we would remove the gene symbol column completely.

Can you update this matrix when you have a chance please? cc @zhangb1 @afarrel @sangeetashukla

@HuangXiaoyan0106
Copy link

@HuangXiaoyan0106 can you update your code to only add ENST id to the transcript_id column and not merge those with gene symbols and then regenerate this matrix? Alternatively for this once, you can remove the gene symbols, but word of caution, there will have to be two splits because of _PAR_Y_ pseudoautosomal regions, which are appended in between some ENST and gene symbols. Example code to get rid of these would be:

@jharenza I only merged them with transcript_id, and the merged isoform format is like this

> head(merged_isoform)
                  transcript_id BS_05CHNS17 BS_0AK4F99X
1    ENST00000000233.9_ARF5-201       12.07       19.76
2    ENST00000000412.7_M6PR-201       13.35        6.41
3  ENST00000000442.10_ESRRA-201        3.66        0.17

And here is my script:

# load libraries
suppressPackageStartupMessages(library(optparse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(stringr))

# read params
option_list <- list(
  make_option(c("--rsemlist"),
    type = "character",
    help = "list of RSEM files"
  ),
  make_option(c("--manifest"),
    type = "character",
    help = "Manifest file "
  ),
  make_option(c("--outdir"),
    type = "character",
    help = "Path to output directory"
  ),
  make_option(c("-o", "--outname_prefix"),
    type = "character",
    help = "outname prefix"
  )
)

# parse the parameters
opt <- parse_args(OptionParser(option_list = option_list))
resemdir <- opt$rsemlist
manifest <- opt$manifest
outdir <- opt$outdir
outname_prefix <- opt$outname_prefix

# read manifest file
manifest <- read.table(manifest, sep=",",head = TRUE)
manifest <- manifest %>%
    select(biospecimen_id, file_name) %>% dplyr::rename(Kids_First_Biospecimen_ID = biospecimen_id, name = file_name)


manifest <- manifest[grep("[.]rsem[.]isoforms[.]results[.]gz", manifest$name), ]

# read and merge RSEM isoform files
read.rsem <- function(x) {
  dat <- data.table::fread(file.path(resemdir, x))
  filename <- x
  sample.id <- manifest[which(manifest$name %in% filename), "Kids_First_Biospecimen_ID"]
  dat$Sample <- sample.id
  return(dat)
}

lfiles <- list.files(path = resemdir, pattern = "*.rsem.isoforms.results.gz", recursive = TRUE)
expr <- lapply(lfiles, read.rsem)
expr <- data.table::rbindlist(expr)

expr.counts <- dcast(expr, transcript_id ~ Sample, value.var = "expected_count") %>%
  # counts
  as.data.frame()
expr.tpm <- dcast(expr, transcript_id ~ Sample, value.var = "TPM") %>%
  # TPM
  as.data.frame()
expr.fpkm <- dcast(expr, transcript_id ~ Sample, value.var = "FPKM") %>%
  # TPM
  as.data.frame()

saveRDS(expr.counts, file = file.path(outdir, paste0(outname_prefix, "-isoform-counts-rsem-expected_count.rds")))
saveRDS(expr.tpm, file = file.path(outdir, paste0(outname_prefix, "-isoform-expression-rsem-tpm.rds")))
saveRDS(expr.fpkm, file = file.path(outdir, paste0(outname_prefix, "-isoform-expression-rsem-fpkm.rds")))

print("Done!")

@jharenza
Copy link
Collaborator

Ah, then we would like to remove gene symbols. Can you use the above code to remove those and only keep ENST ids please?

@HuangXiaoyan0106
Copy link

Ah, then we would like to remove gene symbols. Can you use the above code to remove those and only keep ENST ids please?

Ah? The merged isoform file doesn't contain the gene symbols, the format is as follows.

> head(merged_isoform)
                  transcript_id BS_05CHNS17 BS_0AK4F99X
1    ENST00000000233.9_ARF5-201       12.07       19.76
2    ENST00000000412.7_M6PR-201       13.35        6.41
3  ENST00000000442.10_ESRRA-201        3.66        0.17

@afarrel
Copy link

afarrel commented Jun 20, 2022

Can there be a separate column for gene symbols?

@jharenza
Copy link
Collaborator

@HuangXiaoyan0106 please hold on this until we regroup

@jharenza
Copy link
Collaborator

@HuangXiaoyan0106 will you please add the following to your code, which will split ENSEMBL trancript ids (ENST ids) and the Hugo Gene Symbols) and then also de-duplicate those in pseudoautosomal regions (PAR labels).

# split gene id and symbol
merged_isoform <- merged_isoform %>% 
  mutate(transcript_id = str_replace(transcript_id, "_PAR_Y_", "_"))  %>%
  separate(transcript_id, c("transcript_id", "gene_symbol"), sep = "\\_", extra = "merge") %>%
  unique() 

Also, if it is easier, we only need the tpm file.

Thank you

@HuangXiaoyan0106
Copy link

@zhangb1 zhangb1 removed their assignment Jun 23, 2022
@runjin326
Copy link
Author

@jharenza - I checked the file, it is as expected with transcript and gene separated. Also _PAR_Y_ were removed. I uploaded the file to S3 and updated the md5sum.txt in the s3 as well.

@jharenza
Copy link
Collaborator

jharenza commented Jul 8, 2022

Closed with d3b-center/OpenPedCan-analysis#188

@jharenza jharenza closed this as completed Jul 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants