Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Get TCGA count matrix #285

Closed
runjin326 opened this issue Apr 11, 2022 · 18 comments
Closed

Get TCGA count matrix #285

runjin326 opened this issue Apr 11, 2022 · 18 comments
Assignees

Comments

@runjin326
Copy link

runjin326 commented Apr 11, 2022

What data file(s) does this issue pertain to?

tcga-gene-counts-rsem-expected_count-collapsed.rds
tcga-gene-expression-rsem-tpm-collapsed.rds

What release are you using?

Currently in v10 release, we have TCGA data from UCSC as part of their toil 20k project.
TCGA recently has a new release mapped to GENCODE v36 and we want to get the data and combine them to releasable gene counts.

Put your question or report your issue here.

Please download the star count matrix from GDC portal and combine them for data release.

@jharenza
Copy link
Collaborator

jharenza commented Apr 12, 2022

Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function.

@jharenza jharenza added the v11 label Apr 12, 2022
@zhangb1
Copy link

zhangb1 commented Apr 12, 2022

Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function.

Hi @jharenza I tried to query in cavatica, but seems the files are old files. not the same one showing in the GDC portal...

I neither downloaded the files from GDC portal, can someone show me how to download the counts files, or others can do that?

@taylordm
Copy link

taylordm commented Apr 12, 2022 via email

@zhangb1
Copy link

zhangb1 commented Apr 12, 2022

Thanks @taylordm ... I am launching a cavatica project to download the data now....

Do you know how to download the sample_id information associate with the file, when doing the merge I think we need to have the sample ID in the big merge file. I tried to search in the portal , but no luck. if you can have the manifest having the sample ID within the tsv file, that would be good.

@zhangb1
Copy link

zhangb1 commented Apr 13, 2022

all the GENCODE v36 RNAseq star_gene_counts are in the cavatica project : https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/

Total of 17814 files. I am still trying to see how I can get the sample ID associate with these files. Then I will do the file merge.

@jharenza
Copy link
Collaborator

@yuankunzhu can you help?

@jharenza
Copy link
Collaborator

@yuankunzhu is helping @zhangb1 on this cc @afarrel

@yuankunzhu
Copy link
Member

@zhangb1 you can use this script and modify the line 41 to query cases.samples.submitter_id using file name

@yuankunzhu
Copy link
Member

run below

import requests
import json
import pandas as pd

data = pd.read_csv('gdc_manifest.2022-04-18.txt',sep='\t')
tcga_filenames = data['filename'].tolist()

gdc_url = 'https://api.gdc.cancer.gov/files'
headers = {'Content-Type': 'application/json'}

fields = [
    'file_name',
    'cases.samples.submitter_id'
]
fields = ','.join(fields)

## API request body 
payload = {
        'filters':{
            'op':'=',
            'content':{
                'field':'file_name',
                'value':tcga_filenames}},
        'format':'json',
        'fields':fields,
        'size':5000 # make sure we get all the returns
}
payload = json.dumps(payload)

## hit GDC API file endpoint
gdc_response = requests.post(gdc_url, headers=headers, data=payload)
gdc_response = gdc_response.json()

## iterate .data.hits entity manifest
for i in gdc_response['data']['hits']:
    for j in i['cases']:
        for k in j['samples']:
            print(k['submitter_id']+"\t"+i['file_name'])

and got returns as gdc-sample-id-return.txt

@zhangb1 you might wanna double check as i see some sample ID looks like they are for WES, you can modify the script to query other fields, more details at GDC API endpoint

@zhangb1
Copy link

zhangb1 commented Apr 18, 2022

Okay I modified the script to get the aliquot ID from DGC, since the sample id won't be unique to the files.

all_samples_name_aliquot_id.txt

But still they are 4 files are not unique to the aliquot ID, each has two aliquots attached to it.

0b66c95c-a103-4c7e-99c5-431e89ee1cb3.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PANKFE-01A-01R
0b66c95c-a103-4c7e-99c5-431e89ee1cb3.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPUAR-01A-01R
6718fb2d-efbe-4fd1-b86a-6c37c515041d.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPEAV-01A-01R
6718fb2d-efbe-4fd1-b86a-6c37c515041d.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAPTFZ-01A-01R
752f448c-c9c8-4de8-9fbc-f488ef8a1580.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PANUKV-01A-01R
752f448c-c9c8-4de8-9fbc-f488ef8a1580.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PASUML-01A-01R
a53de3f2-9fa3-4099-a5bc-6103e98ba587.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PAIXIF-01A-01R
a53de3f2-9fa3-4099-a5bc-6103e98ba587.rna_seq.augmented_star_gene_counts.tsv	 TARGET-30-PASYPX-01A-01R

@jharenza

@jharenza
Copy link
Collaborator

@zhangb1 those are TARGET. Are all TCGA matching? If so, 👍🏻

@zhangb1
Copy link

zhangb1 commented Apr 19, 2022

@zhangb1 those are TARGET. Are all TCGA matching? If so, 👍🏻

But which ID I should use for these 4 samples ? @jharenza

@jharenza
Copy link
Collaborator

I don't think we want TARGET in this batch, as we harmonized them on our own.

@zhangb1
Copy link

zhangb1 commented Apr 19, 2022

Oh, we only need the aliquot ID including the name TCGA samples?
That will be 11123 samples(gene counts tsv files) of the 17814 then.

@jharenza
Copy link
Collaborator

Yes

@zhangb1
Copy link

zhangb1 commented Apr 21, 2022

The merged files are in cavatica project here :

tcga-gene-expression-rsem-tpm-collapsed.rds

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/62614f524d85bc2e024aafea/

tcga-gene-counts-rsem-expected_count-collapsed.rds

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/626157224d85bc2e024abc5f/

@jharenza jharenza added the ready label May 2, 2022
@runjin326
Copy link
Author

@zhangb1 - could you please put the files in s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v11/ and update the md5sum file? For some reason, Cavatica kept giving me error messages - maybe I do not have access or something.

@jharenza
Copy link
Collaborator

jharenza commented Jul 8, 2022

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants