Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOLD ecosystem pathway enumerations are out of date #154

Closed
2 tasks
aclum opened this issue Sep 12, 2023 · 19 comments
Closed
2 tasks

GOLD ecosystem pathway enumerations are out of date #154

aclum opened this issue Sep 12, 2023 · 19 comments
Assignees

Comments

@aclum
Copy link
Collaborator

aclum commented Sep 12, 2023

I'm not sure when this was last updated but GOLD's last release of ecosystem pathways was in Sept 2023. I noticed this because value of Peat for column specific ecosystem does not validate and confirmed it is not listed in the enumeration SpecificEcosystemEnum

We should

@turbomam @pkalita-lbl @mslarae13 @shreddd

@mslarae13
Copy link
Contributor

I agree. See comment in microbiomedata/nmdc-schema#1108 (comment)

@mslarae13
Copy link
Contributor

We are missing Bulk Soil which is the 'specific_ecosystem` that Hugh wants to list the NEON samples as.

@aclum
Copy link
Collaborator Author

aclum commented Jan 11, 2024

@turbomam do you have time update the GOLD pathway enumerations? current values in GOLD can be found here https://gold.jgi.doe.gov/ecosystem_classification

@turbomam
Copy link
Member

Where can I find a textual representation of the GOLD pathway elements?

@turbomam
Copy link
Member

Maybe here? GOLD's 5-Level Ecosystem Classification Paths Excel Last generated: 11 Jan, 2024

Clicking the link downloaded this file: GOLDs5levelEcosystemClassificationPaths.xlsx

This should be noted in the schema

@turbomam
Copy link
Member

turbomam commented Jan 11, 2024

Are we adding all values form all five categories into the enums? Here's list of all five, ranked by the number of paths they appear in. I could report it some other way if you want.

Deleting long list for now. Will post somewhere else soon.

@aclum
Copy link
Collaborator Author

aclum commented Jan 11, 2024

@turbomam yes please

@turbomam
Copy link
Member

turbomam commented Jan 11, 2024

How are the GOLD path elements modeled in the nmdc-schema and the submission schema?

Here's the definition of SpecificEcosystemEnum in the compiled submission schema

And the other four enums, which are contiguous at this point in time.

An example value for EcosystemSubtypeEnum is Floodplain and is currently modeled in this style

      Floodplain:
        text: Floodplain
        description: placeholder PV descr

Floodplain doesn't appear anywhere in the nmdc-schema

I think these enumeration origiante in https://github.com/microbiomedata/submission-schema/blob/main/schemasheets/tsv_in/enums.tsv which has been hand-curated up until now.

@turbomam
Copy link
Member

@pkalita-lbl can you please help me think about the GOLD path enum lifecycle?

  • What are the consequences of changing them?
  • Do you still have code in the SubmissionPortal that checks the value in one column against the values in the other four columns?

@turbomam
Copy link
Member

turbomam commented Jan 11, 2024

schemasheets/tsv_in/enums.tsv has the following columns:

  • DH pulldown column
  • DH pulldown option
  • description
  • term_id
  • old SNTC name
  • MIxS see also
  • notes

For all practical purposes, we're just asserting the enum name and the permissible value name in DH pulldown column
and DH pulldown option. I have been asserting 'placeholder PV descr' as the description for some black-magic reason that I can't remember

@turbomam
Copy link
Member

turbomam commented Jan 11, 2024

Fetching ecosystem path data from GOLD

assets/GOLDs5levelEcosystemClassificationPaths.xlsx:
	curl -o $@ https://gold.jgi.doe.gov/download?mode=ecosystempaths

GOLD's source file calls the path elements

  • ECOSYSTEM
  • ECOSYSTEM CATEGORY
  • ECOSYSTEM TYPE
  • ECOSYSTEM SUBTYPE
  • SPECIFIC ECOSYSTEM

We are calling the enums

  • EcosystemCategoryEnum
  • EcosystemEnum
  • EcosystemSubtypeEnum
  • EcosystemTypeEnum
  • SpecificEcosystemEnum

@turbomam
Copy link
Member

I started working on this out of nmdc-schema. We can move this later if it does what you want.

@pkalita-lbl
Copy link
Collaborator

pkalita-lbl commented Jan 11, 2024

Right submission-schema has an enum for each of the GOLD pathway levels. They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum). The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum). I assume the incompleteness was done on purpose because there sure are a lot of options, but that decision predates my time on this project.

There is also custom code in the submission portal that alters the behavior of those five columns so that you only get suggestions for valid paths. The logic is driven in part by this file: https://gold.jgi.doe.gov/download?mode=biosampleEcosystemsJson (we bake a copy into the submission portal code; we don't constantly re-fetch it). So for example, when you go to to fill in the specific_ecosystem column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum.

I see two options going forward:

  1. We could get rid of the 5 enums in submission-schema and make the range of the 5 slots string (mimicking what we do in nmdc-schema). Then the logic for the dropdowns in the submission portal would only be driven by the GOLD JSON file. That makes updating to get the latest GOLD terms easy; it's just that one file. The potential downside is that you lose the ability to exclude GOLD terms that we deem irrelevant.
  2. We write some kind of script to inject all of the GOLD terms into the corresponding enums in submission-schema. The update process would then be: run that script and commit the changes to submission-schema, update the GOLD JSON file in nmdc-server. Depending on how sophisticated we make that script would could potentially exclude certain GOLD paths if that's desired.

@pkalita-lbl
Copy link
Collaborator

Also a long time ago I tried generating a LinkML schema that encoded the valid pathways as rules (code here https://github.com/pkalita-lbl/gold-ecosystems-linkml). The result was so unwieldy that it was basically unusable. So no one suggest doing that!

@turbomam
Copy link
Member

turbomam commented Jan 11, 2024

Thanks @pkalita-lbl !

I have implemented at least half of option 2. from above as

I don't mind if you decide to go with option 1. instead

@turbomam
Copy link
Member

@pkalita-lbl (or anyone): Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

@pkalita-lbl
Copy link
Collaborator

Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

I'm not sure but I think that's another thing that will influence how we implement the long-term process for keeping us in sync with GOLD. So I'm not sure we're ready to jump into implementing anything quite yet.

@mslarae13
Copy link
Contributor

mslarae13 commented Jan 13, 2024

@pkalita-lbl

They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum).

We did intentionally limit this. That said, how it's limited will vary from sample type to sample type (environmental extension to extension)

The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum).

The missing 'lower level' ecosystem terms are cuz GOLD updated and we didn't get the updates.

So for example, when you go to to fill in the specific_ecosystem column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum.

Yes, we don't want to lose this because it should build the same way the GOLD ecosystem tree does: https://gold.jgi.doe.gov/ecosystemtree

Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

@turbomam pretty sure that's a yes. But we haven't done it. it's really just identifying where in the tree we would limit..
So, for water, https://gold.jgi.doe.gov/ecosystemtree
Environmental > Aquatic (then the other 3 are any sub of that).
@aclum please confirm.

@pkalita-lbl
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants