-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GOLD NMDC ingest pipeline to ingest bioscales #113
Conversation
Above is an example of what a Bioscales biosample record dump looks like in the JSON output. Few things to address here:
Question: Is it okay to modify the json_dumper module in linkml-runtime to not output key value pairs where the value, for a certain key, or nested key, is empty? For example, depth is |
Here is a file with GOLD biosample records that have missing annotations. biosamples_missing_annotations.txt The course of action here is to relax the schema some of the slots on |
We need to merge in data from https://docs.google.com/spreadsheets/d/1A6bynpzssAUpnDzoAQPZ-8L5HU2y3IuWX7mRiRrasgk/edit#gid=195687079 |
{
"id": "nmdc:bsm-11-gy2fxa47",
"name": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA - BESC-904-Co3_16_51 rhizosphere",
"description": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA",
"part_of": [
"nmdc:sty-11-namt4020"
],
"env_broad_scale": {
"has_raw_value": "ENVO_00000446",
"term": {
"id": "ENVO:00000446",
"name": "terrestrial biome"
}
},
"env_local_scale": {
"has_raw_value": "ENVO_00005801",
"term": {
"id": "ENVO:00005801",
"name": "rhizosphere"
}
},
"env_medium": {
"has_raw_value": "ENVO_00001998",
"term": {
"id": "ENVO:00001998",
"name": "soil"
}
},
"collected_from": "gold:BESC-904-Co3_16_51_rhizosphere",
"type": "nmdc:Biosample",
"gold_biosample_identifiers": [
"gold:Gb0291773"
],
"elev": {
"has_raw_value": "62",
"has_unit": "meters"
},
"geo_loc_name": {
"has_raw_value": "USA: Oregon"
},
"lat_lon": {
"has_raw_value": "44.5881 -123.1926",
"latitude": 44.5881,
"longitude": -123.1926
},
"samp_taxon_id": "rhizosphere metagenome [NCBITaxon:939928]",
"ecosystem": "Host-associated",
"ecosystem_category": "Plants",
"ecosystem_type": "Roots",
"ecosystem_subtype": "Rhizosphere",
"specific_ecosystem": "Soil",
"add_date": "2021-05-03T00:00:00",
"habitat": "Rhizosphere soil",
"host_name": "Populus",
"location": "USA",
"mod_date": "2021-05-03T00:00:00",
"ncbi_taxonomy_name": "rhizosphere metagenome",
"sample_collection_site": "Rhizosphere Soil"
} Here is an example biosample record that is being outputted by the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujaypatil96 "collected_from": "gold:BESC-904-Co3_16_51_rhizosphere", still uses a gold identifier. I was under the impression that this was now only allowed as an alternative identifier. I think this is supposed to be the sites, if so we need site identifiers and the rule for that should a match on the the tree name so 'BESC-904-Co3_16_51'. There will be some exceptions if samples or analysis failed but there should be on average 3 biosamples per site (tree)
What version of the nmdc schema is this code working against? you say in your original comment that it is 2.0.0, we'll need nmdc-schema version 7.4 for the using a MassIVE identifiers for metabolomics data at the study level. |
I don't see host taxon in the output json |
I don't think host is modeled in the schema yet @turbomam is that correct? |
@sujaypatil96 are you working on this? Can I add you as the assignee? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujaypatil96 This looks good aside from my question about dev vs prod for the NMDC runtime API
.env.example
Outdated
@@ -0,0 +1,4 @@ | |||
BASE_URL=https://api.dev.microbiomedata.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this call prod instead of dev? https://api.microbiomedata.org/
@aclum and @sujaypatil96 do you need any more input from me regarding #113 (comment) |
@turbomam Sujay will update the API to prod which is the last unresolved item that I had. Feel free to look over the json output in the grow and bioscales slack channels for another set of eyes. |
@@ -14,13 +14,14 @@ funowl = "^0.1.11" | |||
git-root = "^0.1" | |||
googlemaps = "^4.6.0" | |||
linkml = "^1.1.18" | |||
nmdc-schema = "^2.0.0" | |||
nmdc-schema = "^7.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally this would happen in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good but I am worried about the code split between this repo and the runtime repo. We need to prioritize moving across ASAP
@sujaypatil96 can this one be closed now? |
Make GOLD NMDC dataset ingest pipeline generic enough to import all studies from GOLD.
For now, the pipeline has been tested on EMP500 and Bioscales, and the JSON output it produces is compatible with
v2.0.0
of the NMDC Schema.Updates to come:
v7
of the schemaCC: @cmungall @emileyfadrosh @turbomam @mslarae13 @aclum