Created merged matrices of gene raw counts for data release #389

jharenza · 2022-08-02T17:00:43Z

What data file(s) does this issue pertain to?

We currently only have gene expected counts and not raw counts in the data releases.

What release are you using?

v12

Put your question or report your issue here.

Create:

gene-counts-star-counts-collapsed.rds (containing all RNA-Seq, including gtex)
tcga-gene-counts-star-counts-collapsed.rds

cc @afarrel in case I am missing anything

The text was updated successfully, but these errors were encountered:

zhangb1 · 2022-08-16T12:59:48Z

@jharenza where we can find the gtex RNAseq counts file? also the tcga one?

I don't think we have those.

jharenza · 2022-08-16T13:04:50Z

@afarrel can you inform please? Perhaps gtex (v8 we are using) counts can be gencode raw from here: https://gtexportal.org/home/datasets

For TCGA, you will have to get from GDC (maybe the ones we previously released are raw counts and not expected, but I'm not sure). I believe you had queried the portal. Maybe @yuankunzhu can help with this.

It's possible that only PBTA+GMKF were using expected but we want to be sure all datasets are using raw. Thanks!

HuangXiaoyan0106 · 2022-08-22T03:49:34Z

@jharenza I have prepared the cwl tool, and once gtex and tcga data are ready I can start the process. One thing, please clarify which count value do you want to keep in the merged matrices.

As explained on the STAR manual. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

jharenza · 2022-08-22T17:07:06Z

Hi @HuangXiaoyan0106. I suspect we will need to use a custom column based on the library types, of which we have:

# A tibble: 5 × 2
  RNA_library         n
  <chr>           <int>
1 exome_capture      11
2 poly-A          27900
3 poly-A stranded   759
4 stranded         1749

@yuankunzhu or @zhangb1 or @afarrel can you inform which columns will go with which library types?

zhangb1 · 2022-08-22T17:52:14Z

Only the poly-A samples are unstranded... using column 2...

@jharenza why the poly-A has 27900 sample?

Others are rf-stranded... which need to use the column 4

jharenza · 2022-08-22T17:55:06Z

Ha! TCGA + GTEX + TARGET, I believe!

jharenza · 2022-08-22T17:56:09Z

@HuangXiaoyan0106 can you set a rule, if RNA_library == "poly-A", use column 2, else use column 4?

HuangXiaoyan0106 · 2022-08-23T08:31:30Z

@HuangXiaoyan0106 can you set a rule, if RNA_library == "poly-A", use column 2, else use column 4?

Sure, I can set it, but I need a manifest that includes all related sample_id and RNA_library info.

jharenza · 2022-08-23T11:07:10Z

You can use the v11 histologies file for this @HuangXiaoyan0106 https://cavatica.sbgenomics.com/u/cavatica/opentarget/files/62cc6541baf2a418322dd179/

jharenza · 2022-08-29T19:42:51Z

@yuankunzhu can you inform / gather a link for @zhangb1 to generate a raw count matrix for TCGA?

chinwallaa · 2022-10-03T19:50:05Z

not able to find raw counts from GDC, @chinwallaa will also look and explore GDC for this.

chinwallaa · 2022-10-04T13:36:20Z

@yuankunzhu @zhangb1 Are these the star-count (raw count - Gene Expression Quantification) data from GDC that we were looking to download ?
https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22[…]trategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D

chinwallaa · 2022-10-10T19:16:50Z

matrices need to be created and subsettted (waiting for new matrrix generation) to other expression files for v12 (genecode 36) folder in s3 bucket. Mark as blocked until PBTA X01 release.

zhangb1 · 2022-10-10T19:31:41Z

@chinwallaa , the files I download for v11 is genecode 36 , see the project here https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files ...
So the merged files in the v11 ,already in gencode 36 .. #285 (comment)

jharenza added the v12 label Aug 2, 2022

jharenza assigned zhangb1 Aug 15, 2022

zhangb1 assigned HuangXiaoyan0106 Aug 16, 2022

zhangb1 added the bix-ops label Aug 18, 2022

jharenza added the blocked label Oct 10, 2022

jharenza closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created merged matrices of gene raw counts for data release #389

Created merged matrices of gene raw counts for data release #389

jharenza commented Aug 2, 2022

zhangb1 commented Aug 16, 2022

jharenza commented Aug 16, 2022

HuangXiaoyan0106 commented Aug 22, 2022 •

edited

Loading

jharenza commented Aug 22, 2022

zhangb1 commented Aug 22, 2022

jharenza commented Aug 22, 2022

jharenza commented Aug 22, 2022

HuangXiaoyan0106 commented Aug 23, 2022

jharenza commented Aug 23, 2022

jharenza commented Aug 29, 2022

chinwallaa commented Oct 3, 2022

chinwallaa commented Oct 4, 2022

chinwallaa commented Oct 10, 2022

zhangb1 commented Oct 10, 2022

Created merged matrices of gene raw counts for data release #389

Created merged matrices of gene raw counts for data release #389

Comments

jharenza commented Aug 2, 2022

What data file(s) does this issue pertain to?

What release are you using?

Put your question or report your issue here.

zhangb1 commented Aug 16, 2022

jharenza commented Aug 16, 2022

HuangXiaoyan0106 commented Aug 22, 2022 • edited Loading

jharenza commented Aug 22, 2022

zhangb1 commented Aug 22, 2022

jharenza commented Aug 22, 2022

jharenza commented Aug 22, 2022

HuangXiaoyan0106 commented Aug 23, 2022

jharenza commented Aug 23, 2022

jharenza commented Aug 29, 2022

chinwallaa commented Oct 3, 2022

chinwallaa commented Oct 4, 2022

chinwallaa commented Oct 10, 2022

zhangb1 commented Oct 10, 2022

HuangXiaoyan0106 commented Aug 22, 2022 •

edited

Loading