Replies: 12 comments 10 replies
-
related issues: |
Beta Was this translation helpful? Give feedback.
-
NCBI stores Here's some highlights from our BBOP/NMDC Biosamples in PostgreSQL database: select
value,
count(1)
from
ncbi_attributes_all_long naal
where
harmonized_name = 'host_taxid'
group by
value
having
count(1) > 1000
order by
count(1) desc
; Resuts
|
Beta Was this translation helpful? Give feedback.
-
Here is the GSC specification. (Thanks @mslarae13 ) https://genomicsstandardsconsortium.github.io/mixs/0000250/#linkml-source name: host_taxid
annotations:
Expected_value:
tag: Expected_value
value: NCBI taxon identifier
description: NCBI taxon id of the host, e.g. 9606
title: host taxid
from_schema: https://w3id.org/mixs
keywords:
- host
- host.
- taxon
slot_uri: MIXS:0000250
alias: host_taxid
domain_of:
- Agriculture
- FoodFarmEnvironment
- HostAssociated
- PlantAssociated
- SymbiontAssociated
range: string |
Beta Was this translation helpful? Give feedback.
-
You can see that this is one of the many MIxS slots for which I never really finished the LinkML conversion. Maybe because I wasn't sure about any of the same stuff we're discussing now! Ideally we would remove the |
Beta Was this translation helpful? Give feedback.
-
Here's the nmdc-schema specification. In retrospect, it's semantics are quite different from GSC's https://microbiomedata.github.io/nmdc-schema/host_taxid/#linkml-source name: host_taxid
annotations:
expected_value:
tag: expected_value
value: NCBI taxon identifier
occurrence:
tag: occurrence
value: '1'
description: NCBI taxon id of the host, e.g. 9606
title: host taxid
comments:
- Homo sapiens [NCBITaxon:9606] would be a reasonable has_raw_value
from_schema: https://w3id.org/nmdc/nmdc
aliases:
- host taxid
rank: 1000
is_a: core field
slot_uri: MIXS:0000250
alias: host_taxid
domain_of:
- Biosample
range: ControlledIdentifiedTermValue
multivalued: false |
Beta Was this translation helpful? Give feedback.
-
Here's the web display of a Biosample with an attribute that can be harmonized to 'host_taxid' https://www.ncbi.nlm.nih.gov/biosample/?term=17663 Which corresponds to this object in NCBI's storage <BioSampleSet>
<BioSample submission_date="2010-07-27T11:58:04.090" last_update="2013-06-12T08:58:21.973" publication_date="2010-07-27T11:58:06.347" access="public" id="17663" accession="SAMN00017663">
<Ids>
<Id db="BioSample" is_primary="1">SAMN00017663</Id>
<Id db="HUVH" db_label="Sample name">CA1_2_FIUYTAS</Id>
<Id db="SRA">SRS085536</Id>
</Ids>
<Description>
<Title>rat gut microbiome</Title>
<Organism taxonomy_id="410656" taxonomy_name="organismal metagenomes"/>
<Comment>
<Paragraph>Rat Gut microbiome before/after antibiotic treatment</Paragraph>
</Comment>
</Description>
<Owner>
<Name abbreviation="CCME">Center for Comparative Microbial Ecology</Name>
<Contacts>
<Contact/>
</Contacts>
</Owner>
<Models>
<Model>Generic</Model>
</Models>
<Package display_name="Generic">Generic.1.0</Package>
<Attributes>
<Attribute attribute_name="biological_specimen">Rat</Attribute>
<Attribute attribute_name="country" harmonized_name="geo_loc_name" display_name="geographic location">Spain</Attribute>
<Attribute attribute_name="host_subject_id" harmonized_name="host_subject_id" display_name="host subject id">Rat</Attribute>
<Attribute attribute_name="host_taxid" harmonized_name="host_taxid" display_name="host taxonomy ID">10114</Attribute>
<Attribute attribute_name="project_name" harmonized_name="project_name" display_name="project name">Rat_transplant_AB</Attribute>
<Attribute attribute_name="sample_name" harmonized_name="sample_name" display_name="sample name">CA1_2_FIUYTAS</Attribute>
<Attribute attribute_name="sample_name_in_paper">A1_D3</Attribute>
</Attributes>
<Status status="live" when="2010-08-19T15:32:06.497"/>
</BioSample>
</BioSampleSet> |
Beta Was this translation helpful? Give feedback.
-
@mslarae13 you said
What was the accession number for the Biosample that had the |
Beta Was this translation helpful? Give feedback.
-
It might also be good to think about where in our technical stack, normalization of identifiers (more generally) should happen. (e.g. if we want users to enter what they are familiar with or encouraged to enter at NCBI, NCBI:txid12345, do we have submission-level code, or validation/fixing code that translates that to the ontology version like, NCBITaxon:12345. Or do we require the normalized ID up front - put a pattern match on the host_taxon field in the submission and some on-the-fly validation that requires this format). I imagine we'll have this issue for other identifiers as well (maybe not exactly the issue of which prefix to use, but instead, which entire ID space to use -- e.g., if a user submits a ChemicalEntity with a PubChem identifier, but we want ChEBI identifiers in our database, where do we do the work of "normalizing" this ID). |
Beta Was this translation helpful? Give feedback.
-
@cmungall has some new tricks in https://github.com/linkml/linkml-store that can help us infer/repair problematic values like the ecofab |
Beta Was this translation helpful? Give feedback.
-
My two cents here is that given the overlapping ID space at NCBI is that we ask for a prefix in the submission portal/submission-schema to avoid confusion and user error. For example, 2821538 as a tax ID resolves to Chloracidobacterium sp. S but 2821538 is also a biosample uid (Populus trichocarpa SKWF-24-3). This would be potentially difficult to untangle after the fact. If we wanted to we could accept txid2821538 since https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid2821538 resolves correctly. This would require updating the prefix or having our own prefix as these IDs don't resolve with ontological views. For nmdc-schema Between NCBI:txid$integer and NCBITaxon:$integer I prefer NCBITaxon:$integer but would be okay with either as long as we make all the records consistent. Existing bioscales records, ex. nmdc:bsm-11-6zd5nb38, uses NCBITaxon. As discussed in some of the tickets @sierra-moxon linked I do think it is undesirable for NCBITaxon prefix expansion to resolve to ontobee.org instead of ncbi. Whichever prefix we use I would like NMDC to use |
Beta Was this translation helpful? Give feedback.
-
FWIW, the MIxS TWG are scheduled to discuss clarifying what the allowed values are for the mixs term: Of course, we can have a different range in our schema, as @mslarae13 says:
But we should monitor to make sure we don't go in a completely different from MIxS |
Beta Was this translation helpful? Give feedback.
-
@mslarae13 did this get discussed at the Sept GSC meeting? |
Beta Was this translation helpful? Give feedback.
-
It was recently discovered from looking at the EcoFab data that host_taxid has no pattern constrain and is a sting and apparently caused some failures as the submitter formatted the metadata field to look like how NCBI displays the metadata (example "NCBI:txid2821538")
In short, the
host_taxid
slot innmdc-schema
requires a curie. In NMDC-submission portal the users provide the ID as it's written by NCBI.nmdc-schema
doesn't recognize NCBI as a valid pre-fix & it's been pointed out that writting the host_taxid as "NCBI:txid2821538" isn't a valid ID/curie structure.To decide
host_taxid
will become MIxS modifiedPlease add your thoughts on how we should proceed and what we can do.
As a note, definitely as a LATER goal, the submission portal should validate these IDs via an API or some check with NCBI. But, this is not a small task & isn't feasible if we need a NOW resolution.
Beta Was this translation helpful? Give feedback.
All reactions