Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Created merged matrices of gene raw counts for data release #389

Closed
jharenza opened this issue Aug 2, 2022 · 14 comments
Closed

Created merged matrices of gene raw counts for data release #389

jharenza opened this issue Aug 2, 2022 · 14 comments

Comments

@jharenza
Copy link
Collaborator

jharenza commented Aug 2, 2022

What data file(s) does this issue pertain to?

We currently only have gene expected counts and not raw counts in the data releases.

What release are you using?

v12

Put your question or report your issue here.

Create:

gene-counts-star-counts-collapsed.rds (containing all RNA-Seq, including gtex)
tcga-gene-counts-star-counts-collapsed.rds

cc @afarrel in case I am missing anything

@zhangb1
Copy link

zhangb1 commented Aug 16, 2022

@jharenza where we can find the gtex RNAseq counts file? also the tcga one?

I don't think we have those.

@jharenza
Copy link
Collaborator Author

@afarrel can you inform please? Perhaps gtex (v8 we are using) counts can be gencode raw from here: https://gtexportal.org/home/datasets

For TCGA, you will have to get from GDC (maybe the ones we previously released are raw counts and not expected, but I'm not sure). I believe you had queried the portal. Maybe @yuankunzhu can help with this.

It's possible that only PBTA+GMKF were using expected but we want to be sure all datasets are using raw. Thanks!

@HuangXiaoyan0106
Copy link

HuangXiaoyan0106 commented Aug 22, 2022

@jharenza I have prepared the cwl tool, and once gtex and tcga data are ready I can start the process. One thing, please clarify which count value do you want to keep in the merged matrices.

As explained on the STAR manual. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

@jharenza
Copy link
Collaborator Author

Hi @HuangXiaoyan0106. I suspect we will need to use a custom column based on the library types, of which we have:

# A tibble: 5 × 2
  RNA_library         n
  <chr>           <int>
1 exome_capture      11
2 poly-A          27900
3 poly-A stranded   759
4 stranded         1749

@yuankunzhu or @zhangb1 or @afarrel can you inform which columns will go with which library types?

@zhangb1
Copy link

zhangb1 commented Aug 22, 2022

Only the poly-A samples are unstranded... using column 2...

@jharenza why the poly-A has 27900 sample?

Others are rf-stranded... which need to use the column 4

@jharenza
Copy link
Collaborator Author

Ha! TCGA + GTEX + TARGET, I believe!

@jharenza
Copy link
Collaborator Author

@HuangXiaoyan0106 can you set a rule, if RNA_library == "poly-A", use column 2, else use column 4?

@HuangXiaoyan0106
Copy link

@HuangXiaoyan0106 can you set a rule, if RNA_library == "poly-A", use column 2, else use column 4?

Sure, I can set it, but I need a manifest that includes all related sample_id and RNA_library info.

@jharenza
Copy link
Collaborator Author

@jharenza
Copy link
Collaborator Author

@yuankunzhu can you inform / gather a link for @zhangb1 to generate a raw count matrix for TCGA?

@chinwallaa
Copy link

not able to find raw counts from GDC, @chinwallaa will also look and explore GDC for this.

@chinwallaa
Copy link

@yuankunzhu @zhangb1 Are these the star-count (raw count - Gene Expression Quantification) data from GDC that we were looking to download ?
https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22[…]trategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D

@chinwallaa
Copy link

matrices need to be created and subsettted (waiting for new matrrix generation) to other expression files for v12 (genecode 36) folder in s3 bucket. Mark as blocked until PBTA X01 release.

@zhangb1
Copy link

zhangb1 commented Oct 10, 2022

@chinwallaa , the files I download for v11 is genecode 36 , see the project here https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files ...
So the merged files in the v11 ,already in gencode 36 .. #285 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants