Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GOLD NMDC ingest pipeline to ingest bioscales #113

Merged
merged 28 commits into from
Mar 14, 2023

Conversation

sujaypatil96
Copy link
Contributor

@sujaypatil96 sujaypatil96 commented Dec 15, 2022

Make GOLD NMDC dataset ingest pipeline generic enough to import all studies from GOLD.

For now, the pipeline has been tested on EMP500 and Bioscales, and the JSON output it produces is compatible with v2.0.0 of the NMDC Schema.

Updates to come:

  • Make pipeline compatible with the latest version, i.e., v7 of the schema

CC: @cmungall @emileyfadrosh @turbomam @mslarae13 @aclum

@sujaypatil96 sujaypatil96 marked this pull request as ready for review December 19, 2022 22:15
@sujaypatil96
Copy link
Contributor Author

{
            "id": "gold:Gb0291745",
            "name": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA - BESC-56-Co3_6_53",
            "description": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA",
            "part_of": [
                "gold:Gs0154044"
            ],
            "env_broad_scale": {
                "has_raw_value": "",
                "term": {
                    "id": ""
                }
            },
            "env_local_scale": {
                "has_raw_value": "ENVO_00005801",
                "term": {
                    "id": "ENVO:00005801"
                }
            },
            "env_medium": {
                "has_raw_value": "ENVO_00001998",
                "term": {
                    "id": "ENVO:00001998"
                }
            },
            "collected_from": "gold:BESC-56-Co3_6_53",
            "type": "nmdc:Biosample",
            "gold_biosample_identifiers": [
                "gold:Gb0291745"
            ],
            "alt": {
                "has_unit": ""
            },
            "depth": {
                "has_unit": ""
            },
            "elev": {
                "has_raw_value": "62",
                "has_unit": "meters"
            },
            "geo_loc_name": {
                "has_raw_value": "USA: Oregon"
            },
            "lat_lon": {
                "has_raw_value": "44.5882 -123.1925",
                "latitude": 44.5882,
                "longitude": -123.1925
            },
            "ecosystem": "Host-associated",
            "ecosystem_category": "Plants",
            "ecosystem_type": "Roots",
            "ecosystem_subtype": "Rhizosphere",
            "specific_ecosystem": "Soil",
            "add_date": "2021-05-03T00:00:00",
            "habitat": "Rhizosphere soil",
            "host_name": "Populus",
            "location": "USA",
            "mod_date": "2022-09-03T00:00:00",
            "ncbi_taxonomy_name": "rhizosphere metagenome",
            "sample_collection_site": "Rhizosphere Soil",
            "subsurface_depth": {
                "has_unit": ""
            }
        }

Above is an example of what a Bioscales biosample record dump looks like in the JSON output.

Few things to address here:

  • empty properties when there are no values for depth, elev, subsurface_depth, etc. are not found in GOLD
    • if we were to set those properties as None in Python, then the entire record would be considered invalid because None is not an acceptable range for any of the slots
  • FieldResearchSite is not being transformed properly in json_dumper output

Question: Is it okay to modify the json_dumper module in linkml-runtime to not output key value pairs where the value, for a certain key, or nested key, is empty? For example, depth is depth: { has_unit: "" } so don't include in output?

@sujaypatil96
Copy link
Contributor Author

sujaypatil96 commented Jan 5, 2023

Here is a file with GOLD biosample records that have missing annotations.

biosamples_missing_annotations.txt

The course of action here is to relax the schema some of the slots on Biosample class by changing them from required: true to recommended: true.

@cmungall
Copy link
Collaborator

cmungall commented Jan 9, 2023

We need to merge in data from https://docs.google.com/spreadsheets/d/1A6bynpzssAUpnDzoAQPZ-8L5HU2y3IuWX7mRiRrasgk/edit#gid=195687079

@sujaypatil96
Copy link
Contributor Author

sujaypatil96 commented Jan 26, 2023

        {
            "id": "nmdc:bsm-11-gy2fxa47",
            "name": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA - BESC-904-Co3_16_51 rhizosphere",
            "description": "Rhizosphere soil microbial communities from poplar common garden site in Corvallis, Oregon, USA",
            "part_of": [
                "nmdc:sty-11-namt4020"
            ],
            "env_broad_scale": {
                "has_raw_value": "ENVO_00000446",
                "term": {
                    "id": "ENVO:00000446",
                    "name": "terrestrial biome"
                }
            },
            "env_local_scale": {
                "has_raw_value": "ENVO_00005801",
                "term": {
                    "id": "ENVO:00005801",
                    "name": "rhizosphere"
                }
            },
            "env_medium": {
                "has_raw_value": "ENVO_00001998",
                "term": {
                    "id": "ENVO:00001998",
                    "name": "soil"
                }
            },
            "collected_from": "gold:BESC-904-Co3_16_51_rhizosphere",
            "type": "nmdc:Biosample",
            "gold_biosample_identifiers": [
                "gold:Gb0291773"
            ],
            "elev": {
                "has_raw_value": "62",
                "has_unit": "meters"
            },
            "geo_loc_name": {
                "has_raw_value": "USA: Oregon"
            },
            "lat_lon": {
                "has_raw_value": "44.5881 -123.1926",
                "latitude": 44.5881,
                "longitude": -123.1926
            },
            "samp_taxon_id": "rhizosphere metagenome [NCBITaxon:939928]",
            "ecosystem": "Host-associated",
            "ecosystem_category": "Plants",
            "ecosystem_type": "Roots",
            "ecosystem_subtype": "Rhizosphere",
            "specific_ecosystem": "Soil",
            "add_date": "2021-05-03T00:00:00",
            "habitat": "Rhizosphere soil",
            "host_name": "Populus",
            "location": "USA",
            "mod_date": "2021-05-03T00:00:00",
            "ncbi_taxonomy_name": "rhizosphere metagenome",
            "sample_collection_site": "Rhizosphere Soil"
        }

Here is an example biosample record that is being outputted by the pipeline.

@sujaypatil96 sujaypatil96 requested a review from cmungall January 26, 2023 23:18
@aclum aclum self-requested a review January 26, 2023 23:39
Copy link

@aclum aclum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujaypatil96 "collected_from": "gold:BESC-904-Co3_16_51_rhizosphere", still uses a gold identifier. I was under the impression that this was now only allowed as an alternative identifier. I think this is supposed to be the sites, if so we need site identifiers and the rule for that should a match on the the tree name so 'BESC-904-Co3_16_51'. There will be some exceptions if samples or analysis failed but there should be on average 3 biosamples per site (tree)

@aclum
Copy link

aclum commented Jan 27, 2023

What version of the nmdc schema is this code working against? you say in your original comment that it is 2.0.0, we'll need nmdc-schema version 7.4 for the using a MassIVE identifiers for metabolomics data at the study level.

@cmungall
Copy link
Collaborator

I don't see host taxon in the output json

@aclum
Copy link

aclum commented Jan 27, 2023

I don't think host is modeled in the schema yet @turbomam is that correct?

@ssarrafan
Copy link

@sujaypatil96 are you working on this? Can I add you as the assignee?

Copy link

@aclum aclum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujaypatil96 This looks good aside from my question about dev vs prod for the NMDC runtime API

.env.example Outdated
@@ -0,0 +1,4 @@
BASE_URL=https://api.dev.microbiomedata.org
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call prod instead of dev? https://api.microbiomedata.org/

@turbomam
Copy link
Member

@aclum and @sujaypatil96 do you need any more input from me regarding #113 (comment)

@aclum
Copy link

aclum commented Feb 10, 2023

@turbomam Sujay will update the API to prod which is the last unresolved item that I had. Feel free to look over the json output in the grow and bioscales slack channels for another set of eyes.

@@ -14,13 +14,14 @@ funowl = "^0.1.11"
git-root = "^0.1"
googlemaps = "^4.6.0"
linkml = "^1.1.18"
nmdc-schema = "^2.0.0"
nmdc-schema = "^7.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this would happen in a separate PR

Copy link
Collaborator

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good but I am worried about the code split between this repo and the runtime repo. We need to prioritize moving across ASAP

@ssarrafan
Copy link

@sujaypatil96 can this one be closed now?

@sujaypatil96 sujaypatil96 merged commit ee481a6 into main Mar 14, 2023
@sujaypatil96 sujaypatil96 deleted the gold-bioscales-ingest branch March 14, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
5 participants