Get TCGA count matrix #285

runjin326 · 2022-04-11T20:30:39Z

What data file(s) does this issue pertain to?

tcga-gene-counts-rsem-expected_count-collapsed.rds
tcga-gene-expression-rsem-tpm-collapsed.rds

What release are you using?

Currently in v10 release, we have TCGA data from UCSC as part of their toil 20k project.
TCGA recently has a new release mapped to GENCODE v36 and we want to get the data and combine them to releasable gene counts.

Put your question or report your issue here.

Please download the star count matrix from GDC portal and combine them for data release.

The text was updated successfully, but these errors were encountered:

jharenza · 2022-04-12T00:57:00Z

Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function.

zhangb1 · 2022-04-12T12:34:34Z

Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function.

Hi @jharenza I tried to query in cavatica, but seems the files are old files. not the same one showing in the GDC portal...

I neither downloaded the files from GDC portal, can someone show me how to download the counts files, or others can do that?

taylordm · 2022-04-12T12:39:14Z

Here’s the page on the GDC transfer tool. If you need me to install it run it let me know https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

…

On Apr 12, 2022, at 8:34 AM, Bo Zhang ***@***.***> wrote: Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function. Hi @jharenza I tried to query in cavatica, but seems the files are old files. not the same one showing in the GDC portal... I neither downloaded the files from GDC portal, can someone show me how to download the counts files, or others can do that? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

zhangb1 · 2022-04-12T14:39:54Z

Thanks @taylordm ... I am launching a cavatica project to download the data now....

Do you know how to download the sample_id information associate with the file, when doing the merge I think we need to have the sample ID in the big merge file. I tried to search in the portal , but no luck. if you can have the manifest having the sample ID within the tsv file, that would be good.

zhangb1 · 2022-04-13T12:54:20Z

all the GENCODE v36 RNAseq star_gene_counts are in the cavatica project : https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/

Total of 17814 files. I am still trying to see how I can get the sample ID associate with these files. Then I will do the file merge.

jharenza · 2022-04-13T13:06:57Z

@yuankunzhu can you help?

jharenza · 2022-04-18T19:41:14Z

@yuankunzhu is helping @zhangb1 on this cc @afarrel

yuankunzhu · 2022-04-18T20:33:33Z

@zhangb1 you can use this script and modify the line 41 to query cases.samples.submitter_id using file name

yuankunzhu · 2022-04-18T20:50:44Z

run below

import requests
import json
import pandas as pd

data = pd.read_csv('gdc_manifest.2022-04-18.txt',sep='\t')
tcga_filenames = data['filename'].tolist()

gdc_url = 'https://api.gdc.cancer.gov/files'
headers = {'Content-Type': 'application/json'}

fields = [
    'file_name',
    'cases.samples.submitter_id'
]
fields = ','.join(fields)

## API request body 
payload = {
        'filters':{
            'op':'=',
            'content':{
                'field':'file_name',
                'value':tcga_filenames}},
        'format':'json',
        'fields':fields,
        'size':5000 # make sure we get all the returns
}
payload = json.dumps(payload)

## hit GDC API file endpoint
gdc_response = requests.post(gdc_url, headers=headers, data=payload)
gdc_response = gdc_response.json()

## iterate .data.hits entity manifest
for i in gdc_response['data']['hits']:
    for j in i['cases']:
        for k in j['samples']:
            print(k['submitter_id']+"\t"+i['file_name'])

and got returns as gdc-sample-id-return.txt

@zhangb1 you might wanna double check as i see some sample ID looks like they are for WES, you can modify the script to query other fields, more details at GDC API endpoint

zhangb1 · 2022-04-18T22:57:01Z

Okay I modified the script to get the aliquot ID from DGC, since the sample id won't be unique to the files.

all_samples_name_aliquot_id.txt

But still they are 4 files are not unique to the aliquot ID, each has two aliquots attached to it.

0b66c95c-a103-4c7e-99c5-431e89ee1cb3.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PANKFE-01A-01R
0b66c95c-a103-4c7e-99c5-431e89ee1cb3.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPUAR-01A-01R
6718fb2d-efbe-4fd1-b86a-6c37c515041d.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPEAV-01A-01R
6718fb2d-efbe-4fd1-b86a-6c37c515041d.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPTFZ-01A-01R
752f448c-c9c8-4de8-9fbc-f488ef8a1580.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PANUKV-01A-01R
752f448c-c9c8-4de8-9fbc-f488ef8a1580.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PASUML-01A-01R
a53de3f2-9fa3-4099-a5bc-6103e98ba587.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAIXIF-01A-01R
a53de3f2-9fa3-4099-a5bc-6103e98ba587.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PASYPX-01A-01R

@jharenza

jharenza · 2022-04-18T23:15:21Z

@zhangb1 those are TARGET. Are all TCGA matching? If so, 👍🏻

zhangb1 · 2022-04-19T13:05:42Z

@zhangb1 those are TARGET. Are all TCGA matching? If so, 👍🏻

But which ID I should use for these 4 samples ? @jharenza

jharenza · 2022-04-19T13:19:38Z

I don't think we want TARGET in this batch, as we harmonized them on our own.

zhangb1 · 2022-04-19T13:28:47Z

Oh, we only need the aliquot ID including the name TCGA samples?
That will be 11123 samples(gene counts tsv files) of the 17814 then.

jharenza · 2022-04-19T14:38:51Z

Yes

zhangb1 · 2022-04-21T13:22:28Z

The merged files are in cavatica project here :

tcga-gene-expression-rsem-tpm-collapsed.rds

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/62614f524d85bc2e024aafea/

tcga-gene-counts-rsem-expected_count-collapsed.rds

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/626157224d85bc2e024abc5f/

runjin326 · 2022-05-04T12:34:21Z

@zhangb1 - could you please put the files in s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v11/ and update the md5sum file? For some reason, Cavatica kept giving me error messages - maybe I do not have access or something.

jharenza · 2022-07-08T18:11:39Z

d3b-center/OpenPedCan-analysis#188

jharenza assigned zhangb1 Apr 12, 2022

jharenza added the v11 label Apr 12, 2022

jharenza added the ready label May 2, 2022

runjin326 mentioned this issue May 25, 2022

v11 release #239

Closed

30 tasks

jharenza mentioned this issue Jul 7, 2022

V11 Release (1/N) d3b-center/OpenPedCan-analysis#188

Merged

5 tasks

jharenza closed this as completed Jul 8, 2022

zhangb1 mentioned this issue Oct 10, 2022

Created merged matrices of gene raw counts for data release #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get TCGA count matrix #285

Get TCGA count matrix #285

runjin326 commented Apr 11, 2022 •

edited by jharenza

Loading

jharenza commented Apr 12, 2022 •

edited

Loading

zhangb1 commented Apr 12, 2022

taylordm commented Apr 12, 2022 via email

zhangb1 commented Apr 12, 2022 •

edited

Loading

zhangb1 commented Apr 13, 2022

jharenza commented Apr 13, 2022

jharenza commented Apr 18, 2022

yuankunzhu commented Apr 18, 2022

yuankunzhu commented Apr 18, 2022

zhangb1 commented Apr 18, 2022

jharenza commented Apr 18, 2022

zhangb1 commented Apr 19, 2022 •

edited

Loading

jharenza commented Apr 19, 2022

zhangb1 commented Apr 19, 2022

jharenza commented Apr 19, 2022

zhangb1 commented Apr 21, 2022

runjin326 commented May 4, 2022

jharenza commented Jul 8, 2022

Get TCGA count matrix #285

Get TCGA count matrix #285

Comments

runjin326 commented Apr 11, 2022 • edited by jharenza Loading

What data file(s) does this issue pertain to?

What release are you using?

Put your question or report your issue here.

jharenza commented Apr 12, 2022 • edited Loading

zhangb1 commented Apr 12, 2022

taylordm commented Apr 12, 2022 via email

zhangb1 commented Apr 12, 2022 • edited Loading

zhangb1 commented Apr 13, 2022

jharenza commented Apr 13, 2022

jharenza commented Apr 18, 2022

yuankunzhu commented Apr 18, 2022

yuankunzhu commented Apr 18, 2022

zhangb1 commented Apr 18, 2022

jharenza commented Apr 18, 2022

zhangb1 commented Apr 19, 2022 • edited Loading

jharenza commented Apr 19, 2022

zhangb1 commented Apr 19, 2022

jharenza commented Apr 19, 2022

zhangb1 commented Apr 21, 2022

runjin326 commented May 4, 2022

jharenza commented Jul 8, 2022

runjin326 commented Apr 11, 2022 •

edited by jharenza

Loading

jharenza commented Apr 12, 2022 •

edited

Loading

zhangb1 commented Apr 12, 2022 •

edited

Loading

zhangb1 commented Apr 19, 2022 •

edited

Loading