Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add key for genomic reference to template #352

Closed
jaybee84 opened this issue Oct 11, 2023 · 6 comments
Closed

add key for genomic reference to template #352

jaybee84 opened this issue Oct 11, 2023 · 6 comments
Labels

Comments

@jaybee84
Copy link
Contributor

List the new key
Using OLS or a similar resource, please find a standard term that fits your needs. OLS-listed NCIT and OBI dictionaries are preferred sources. Non-OLS sources of ground truth for specific items such as ONCOTREE for cancer types, or GEO for platform definitions are also appropriate.
Key:reference_genome_build

Provide definition and source for the new key
Definition: The source-specific version of the published genome assembly. [ NCI ]
Source url: https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C164815

Please describe why this concept is necessary
For aligned read files it is important to know which reference genome the reads were aligned to. This helps users decide downstream analytical steps (i.e. whether to re-align or liftover to newer builds) while re-using the data.

Please describe the situations in which this concept may apply
This concept is important if 1) a user is looking for additional data that matches their own analytical dataset (i.e. aligned to same genome build), 2) needs to decide on downstream analytical steps after downloading the dataset from NF Data Portal.

@jaybee84 jaybee84 added the key label Oct 11, 2023
@anngvu
Copy link
Collaborator

anngvu commented Oct 12, 2023

We do actually have, so I think we just need to add it to the template

genomicReference:
description: Version of genome reference used for alignment in processing workflow
notes:
- Currently used with nextflow-processed data.
required: false

@anngvu anngvu changed the title add key: [reference_genome_build] add key for genomic reference to template Oct 12, 2023
@allaway
Copy link
Contributor

allaway commented Oct 12, 2023

What value should we be using here? There are a lot of standard genome values, but for my recent annotation (which inspired this issue), the reference is less standard and hosted on synapse: https://www.synapse.org/#!Synapse:syn50670703

I suppose in this case the name we could use is GRCh38_no_alt_plus_hs38d1 or GRCh38_no_alt_analysis_set_GCA_000001405.15

@jaybee84
Copy link
Contributor Author

Looks like this reference is called GRCh38_Verily_v1 : https://cloud.google.com/life-sciences/docs/resources/public-datasets/reference-genomes (Verily's GRCh38)

@allaway
Copy link
Contributor

allaway commented Oct 12, 2023

Yah, I saw that, but I think that the Verily flavor has some additional changes.

@jaybee84
Copy link
Contributor Author

jaybee84 commented Oct 12, 2023

The file linked in the Synapse ID seems to be similar to the following file listed in the Verily's GRCH38 README:

D. GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz

A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
7.  human decoy sequences from hs38d1 (GCA_000786075.2)

I wish Verily did a better job and assigned unique IDs for the different builds. They released 4 slightly different builds in one version! :(

I guess you could also just link to the SynapseID where this specific build is stored.

@allaway
Copy link
Contributor

allaway commented Oct 12, 2023

Ah haha, this is very confusing. I didn't read the README, I just looked at this part:

The base assembly is GRCh38_no_alt_plus_hs38d1, which was created specifically for analysis. Its rationale and exact genomic modifications are documented in its README file.

Verily applied the following modifications to the base assembly:

Reference segment names are prefixed with chr. Many of the additional data files are provided by GENCODE, which uses the "chr" naming convention.
All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification.
This release of the genome reference is named GRCh38_Verily_v1.

I just read this as "GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz" is the 'vanilla' genome before Verily did anything to it. But yeah, the README makes things less clear....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants