add key for genomic reference to template #352

jaybee84 · 2023-10-11T22:58:17Z

List the new key
Using OLS or a similar resource, please find a standard term that fits your needs. OLS-listed NCIT and OBI dictionaries are preferred sources. Non-OLS sources of ground truth for specific items such as ONCOTREE for cancer types, or GEO for platform definitions are also appropriate.
Key:reference_genome_build

Provide definition and source for the new key
Definition: The source-specific version of the published genome assembly. [ NCI ]
Source url: https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C164815

Please describe why this concept is necessary
For aligned read files it is important to know which reference genome the reads were aligned to. This helps users decide downstream analytical steps (i.e. whether to re-align or liftover to newer builds) while re-using the data.

Please describe the situations in which this concept may apply
This concept is important if 1) a user is looking for additional data that matches their own analytical dataset (i.e. aligned to same genome build), 2) needs to decide on downstream analytical steps after downloading the dataset from NF Data Portal.

The text was updated successfully, but these errors were encountered:

anngvu · 2023-10-12T15:02:22Z

We do actually have, so I think we just need to add it to the template

nf-metadata-dictionary/modules/props.yaml

Lines 295 to 299 in 91a33e6

    
           genomicReference: 
        
             description: Version of genome reference used for alignment in processing workflow 
        
             notes: 
        
               - Currently used with nextflow-processed data. 
        
             required: false

allaway · 2023-10-12T16:31:52Z

What value should we be using here? There are a lot of standard genome values, but for my recent annotation (which inspired this issue), the reference is less standard and hosted on synapse: https://www.synapse.org/#!Synapse:syn50670703

I suppose in this case the name we could use is GRCh38_no_alt_plus_hs38d1 or GRCh38_no_alt_analysis_set_GCA_000001405.15

jaybee84 · 2023-10-12T17:04:23Z

Looks like this reference is called GRCh38_Verily_v1 : https://cloud.google.com/life-sciences/docs/resources/public-datasets/reference-genomes (Verily's GRCh38)

allaway · 2023-10-12T20:47:09Z

Yah, I saw that, but I think that the Verily flavor has some additional changes.

jaybee84 · 2023-10-12T20:54:54Z

The file linked in the Synapse ID seems to be similar to the following file listed in the Verily's GRCH38 README:

D. GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz

A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
7.  human decoy sequences from hs38d1 (GCA_000786075.2)

I wish Verily did a better job and assigned unique IDs for the different builds. They released 4 slightly different builds in one version! :(

I guess you could also just link to the SynapseID where this specific build is stored.

allaway · 2023-10-12T22:04:11Z

Ah haha, this is very confusing. I didn't read the README, I just looked at this part:

The base assembly is GRCh38_no_alt_plus_hs38d1, which was created specifically for analysis. Its rationale and exact genomic modifications are documented in its README file.

Verily applied the following modifications to the base assembly:

Reference segment names are prefixed with chr. Many of the additional data files are provided by GENCODE, which uses the "chr" naming convention.
All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification.
This release of the genome reference is named GRCh38_Verily_v1.

I just read this as "GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz" is the 'vanilla' genome before Verily did anything to it. But yeah, the README makes things less clear....

jaybee84 added the key label Oct 11, 2023

anngvu changed the title ~~add key: [reference_genome_build]~~ add key for genomic reference to template Oct 12, 2023

anngvu mentioned this issue Oct 12, 2023

Patch/seq and processing #354

Merged

anngvu closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add key for genomic reference to template #352

add key for genomic reference to template #352

jaybee84 commented Oct 11, 2023

anngvu commented Oct 12, 2023

allaway commented Oct 12, 2023

jaybee84 commented Oct 12, 2023

allaway commented Oct 12, 2023

jaybee84 commented Oct 12, 2023 •

edited

Loading

allaway commented Oct 12, 2023

add key for genomic reference to template #352

add key for genomic reference to template #352

Comments

jaybee84 commented Oct 11, 2023

anngvu commented Oct 12, 2023

allaway commented Oct 12, 2023

jaybee84 commented Oct 12, 2023

allaway commented Oct 12, 2023

jaybee84 commented Oct 12, 2023 • edited Loading

allaway commented Oct 12, 2023

jaybee84 commented Oct 12, 2023 •

edited

Loading