-
Notifications
You must be signed in to change notification settings - Fork 0
Get TCGA count matrix #285
Comments
Hi @jharenza I tried to query in cavatica, but seems the files are old files. not the same one showing in the GDC portal... I neither downloaded the files from GDC portal, can someone show me how to download the counts files, or others can do that? |
Here’s the page on the GDC transfer tool. If you need me to install it run it let me know
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
… On Apr 12, 2022, at 8:34 AM, Bo Zhang ***@***.***> wrote:
Hi @zhangb1. Assigned to you. As mentioned in our meeting today, @taylordm's query can be found here and counts can be converted to TPM using this R function.
Hi @jharenza I tried to query in cavatica, but seems the files are old files. not the same one showing in the GDC portal...
I neither downloaded the files from GDC portal, can someone show me how to download the counts files, or others can do that?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Thanks @taylordm ... I am launching a cavatica project to download the data now.... Do you know how to download the sample_id information associate with the file, when doing the merge I think we need to have the sample ID in the big merge file. I tried to search in the portal , but no luck. if you can have the manifest having the sample ID within the tsv file, that would be good. |
all the GENCODE v36 RNAseq star_gene_counts are in the cavatica project : https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/files/ Total of 17814 files. I am still trying to see how I can get the sample ID associate with these files. Then I will do the file merge. |
@yuankunzhu can you help? |
@yuankunzhu is helping @zhangb1 on this cc @afarrel |
@zhangb1 you can use this script and modify the line 41 to query |
run below import requests
import json
import pandas as pd
data = pd.read_csv('gdc_manifest.2022-04-18.txt',sep='\t')
tcga_filenames = data['filename'].tolist()
gdc_url = 'https://api.gdc.cancer.gov/files'
headers = {'Content-Type': 'application/json'}
fields = [
'file_name',
'cases.samples.submitter_id'
]
fields = ','.join(fields)
## API request body
payload = {
'filters':{
'op':'=',
'content':{
'field':'file_name',
'value':tcga_filenames}},
'format':'json',
'fields':fields,
'size':5000 # make sure we get all the returns
}
payload = json.dumps(payload)
## hit GDC API file endpoint
gdc_response = requests.post(gdc_url, headers=headers, data=payload)
gdc_response = gdc_response.json()
## iterate .data.hits entity manifest
for i in gdc_response['data']['hits']:
for j in i['cases']:
for k in j['samples']:
print(k['submitter_id']+"\t"+i['file_name']) and got returns as gdc-sample-id-return.txt @zhangb1 you might wanna double check as i see some sample ID looks like they are for WES, you can modify the script to query other fields, more details at GDC API endpoint |
Okay I modified the script to get the aliquot ID from DGC, since the sample id won't be unique to the files. all_samples_name_aliquot_id.txt But still they are 4 files are not unique to the aliquot ID, each has two aliquots attached to it.
|
@zhangb1 those are TARGET. Are all TCGA matching? If so, 👍🏻 |
I don't think we want TARGET in this batch, as we harmonized them on our own. |
Oh, we only need the aliquot ID including the name |
Yes |
The merged files are in cavatica project here : tcga-gene-expression-rsem-tpm-collapsed.rds tcga-gene-counts-rsem-expected_count-collapsed.rds |
@zhangb1 - could you please put the files in s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v11/ and update the md5sum file? For some reason, Cavatica kept giving me error messages - maybe I do not have access or something. |
What data file(s) does this issue pertain to?
What release are you using?
Currently in v10 release, we have TCGA data from UCSC as part of their toil 20k project.
TCGA recently has a new release mapped to GENCODE v36 and we want to get the data and combine them to releasable gene counts.
Put your question or report your issue here.
Please download the star count matrix from GDC portal and combine them for data release.
The text was updated successfully, but these errors were encountered: