Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if and how rel_to_oxygen will be used in the submission schema #58

Closed
Tracked by #587
turbomam opened this issue Mar 8, 2022 · 21 comments
Closed
Tracked by #587
Assignees
Labels
Prod code release needed Updates made, production not currently using this version. Code release required to see in prod

Comments

@turbomam
Copy link
Member

turbomam commented Mar 8, 2022

This illustrates approaches for repairing columns with enumerations of permissible values, also known as controlled vocabularies. Look for 'enumeration' in the MIxS' Expected value column, or a range of '*_enum' in the LinkML model. See reference material below.

Related code: sample_annotator/rel_to_oxygen_example.py

Permissible values

  • aerobe
  • anaerobe
  • facultative
  • microaerophilic
  • microanaerobe
  • obligate aerobe
  • obligate anaerobe

Reference material

Observed, with matches

rel_to_oxygen r2o_count lc_trimmed_r2o match interpretation
None 44666 none    
aerobe 3940 aerobe aerobe  
obligate anaerobe 66 obligate anaerobe obligate anaerobe  
oxic 59 oxic    
anaerobe 29 anaerobe anaerobe  
facultative anaerobes 21 facultative anaerobes   facultative
Aerobic 20 aerobic aerobe  
aerobic 18 aerobic aerobe  
anaerobic 18 anaerobic anaerobe  
Oxic 13 oxic    
microaerophilic 11 microaerophilic microaerophilic  
hypoxic 6 hypoxic    
normal oxic seawater 4 normal oxic seawater    
oxic/anoxic boundary 4 oxic/anoxic boundary    
22 mg/l 3 22 mg/l   oxic
6.0-6.5 mg/l 3 6.0-6.5 mg/l    
Hypoxic 3 hypoxic    
0 mg/l 1 0 mg/l   anoxic
1.0-2.2 mg/l 1 1.0-2.2 mg/l    
23.5 mg/l 1 23.5 mg/l   oxic
25 mg/l 1 25 mg/l   oxic
aerobic-anaerobic 1 aerobic-anaerobic    
facultative 1 facultative facultative  
facultative anaerobe 1 facultative anaerobe   facultative
obligate 1 obligate    

Easy fixes:

  • normalize capitalization and trim extra, leading and trailing whitespace
  • allow matches between noun and adjective forms (aerobic -> aerobe)
  • allow matches between facultative anaerobe and facultative
    • see background information below

Trickier!

Probably not justified when the count is really low, like 1

Gotchas:

  • aerobe is a noun: a microorganism that requires the presence of oxygen
  • aerobic is an adjective which can be applied to an organism
  • I believe that oxic is an adjective that describes the water an organism lives in, not the organism itself
@mslarae13
Copy link

mslarae13 commented Jan 30, 2023

Can we run a query that asks "of the samples captured in NMDC (mongoDB), do any of the Biosample objects have this slot (or oxy_stat_samp) filled out? If so, what is there?"

@mslarae13 mslarae13 moved this from 🔖 Ready to 🏗 In progress in SubPort Squad Issues Jan 30, 2023
@mslarae13
Copy link

Once @turbomam has made a query from NMDC mongoDB, reassign to Montana to check

@mslarae13 mslarae13 moved this from 🏗 In progress to 📋 Backlog in SubPort Squad Issues Feb 9, 2023
@mslarae13 mslarae13 moved this from 📋 Backlog to 🔖 Ready in SubPort Squad Issues May 5, 2023
@mslarae13
Copy link

Only keep rel_to_oxygen. Note in rel_to_oxygen that this is applicable to "Column: oxygenation status of sample".

@turbomam
Copy link
Member Author

db.getCollection("biosample_set").find( { part_of : { $exists : true } } );

2449

db.getCollection("biosample_set").find( { rel_to_oxygen : { $exists : true } } );

0

db.getCollection("biosample_set").find( { part_of : { $exists : true } } );

0

@turbomam
Copy link
Member Author

Neither rel_to_oxygen nor oxy_stat_samp hav been provided for any biosample in the production MongoDB as of this date.

@turbomam
Copy link
Member Author

Structured comment name Item (rdfs:label) Definition Expected value Value syntax Example Section migs_eu migs_ba migs_pl migs_vi migs_org mims mimarks_s mimarks_c misag mimag miuvig Preferred unit Occurence MIXS ID
rel_to_oxygen relationship to oxygen Is this organism an aerobe, anaerobe? Please note that aerobic and anaerobic are valid descriptors for microbial environments enumeration [aerobe|anaerobe|facultative|microaerophilic|microanaerobe|obligate aerobe|obligate anaerobe] aerobe nucleic acid sequence source - C - - - X X C X X -   1 MIXS:0000015
Environmental package Structured comment name Package item Definition Expected value Value syntax Example Requirement Preferred unit Occurrence MIXS ID
agriculture oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic]   C   1 MIXS:0000753
air oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
host-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-gut oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-oral oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-skin oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
human-vaginal oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
hydrocarbon resources-cores oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
hydrocarbon resources-fluids/swabs oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
microbial mat/biofilm oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
miscellaneous natural or artificial environment oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
plant-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
sediment oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
symbiont-associated oxy_stat_samp oxygenation status of sample Oxygenation status of sample. enumeration [aerobic|anaerobic] aerobic X   1 MIXS:0000753
wastewater/sludge oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753
water oxy_stat_samp oxygenation status of sample Oxygenation status of sample enumeration [aerobic|anaerobic|other] aerobic X   1 MIXS:0000753

@turbomam
Copy link
Member Author

@mslarae13 I agree that we should only use one of rel_to_oxygen or oxy_stat_samp for NMDC biosamples. I would be more inclined to use oxy_stat_samp since it is supposed to be about samples and rel_to_oxygen is supposed to be about organisms (for checklists like MIGS-ba?)

After deciding that, we should combine the values from the rel_to_oxygen enumeration, the oxy_stat_samp enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.

@turbomam turbomam moved this from 🔖 Ready to 🏗 In progress in SubPort Squad Issues May 21, 2023
@turbomam
Copy link
Member Author

turbomam commented May 21, 2023

Here are the oxy_stat_samp values in BBOP's relational version of NCBI's biosample_set

select
	value,
	count(1)
from
	all_attribs aa
where
	aa.harmonized_name = 'oxy_stat_samp'
group by
	value
order by
	count(1) desc ;
value count
aerobic 4420
NA 4332
anaerobic 3890
not collected 2730
not applicable 2165
missing 1757
anaerobe 535
NOT APPLICABLE 418
0,00 144
none 123
N/A 39
aerobe 36
Unknown 26
not collecte 24
Not collected 17
Not available 16
5,03 6
7,37 6
14,75 6
10,45 6
4,89 6
7,07 6
unknown 6
7,50 6
4,84 6
10,56 6
10,06 6
7,76 6
3,53 6
9,67 6
2,34 6
6,80 6
5,22 6
2,45 6
4,67 6
13,60 6
5,28 6
3,75 6
15,51 6
8.62 3
Not applicable 2
7,29 2
17,24 2
17,24 mg/L 1
not provided 1
the sediment is anoxic but the water colum contains O2 (6.78-7.66 mg/L) 1

@turbomam
Copy link
Member Author

turbomam commented May 21, 2023

If we use oxy_stat_samp, maybe the following really would be adequate

  • aerobic
  • anaerobic
  • other

The other oxy_stat_samp values boil down to either some variant of NA or a concentration of oxygen, presumably in mg/L

@turbomam turbomam changed the title subject matter knowledge for rel_to_oxygen Determine if and how rel_to_oxygen will be used in the submission schema May 21, 2023
@ssarrafan
Copy link

Adding to current sprint per Mark. Need feedback from @mslarae13

@mslarae13
Copy link

mslarae13 commented May 24, 2023

I would be more inclined to use oxy_stat_samp since it is supposed to be about samples and rel_to_oxygen is supposed to be about organisms (for checklists like MIGS-ba?)

  • I'm good with that!

After deciding that, we should combine the values from the rel_to_oxygen enumeration, the oxy_stat_samp enumeration and the values found the NCBI's biosample_set into one reasonable NMDC enumeration.

  • Also agree with providing the full enumeration list.
    @turbomam

@turbomam
Copy link
Member Author

turbomam commented Jun 1, 2023

@mslarae13 I'm starting this now. I will provide the list of enumerated values soon.

@turbomam
Copy link
Member Author

turbomam commented Jun 1, 2023

src/schema/mixs.yaml alredy has this

rel_to_oxygen_enum:
  from_schema: http://w3id.org/mixs/terms
  permissible_values:
    aerobe: {}
    anaerobe: {}
    facultative: {}
    microaerophilic: {}
    microanaerobe: {}
    obligate aerobe: {}
    obligate anaerobe: {}

and

oxy_stat_samp_enum:
  from_schema: http://w3id.org/mixs/terms
  permissible_values:
    aerobic: {}
    anaerobic: {}
    other: {}

@turbomam
Copy link
Member Author

turbomam commented Jun 1, 2023

Let's leave the range of oxy_stat_samp as the existing oxy_stat_samp_enum. I don't think it makes sense to describe a sample as any of these

  • facultative
  • microaerophilic
  • microanaerobe
  • obligate aerobe
  • obligate anaerobe

I guess if we found some decisive cutoffs between different oxygenation states, we could update oxy_stat_samp_enum.

@ssarrafan
Copy link

Based on recent update will move to new sprint to be closed

@mslarae13
Copy link

@turbomam I'm good with that. Will we leave the 'other' option?

@turbomam
Copy link
Member Author

turbomam commented Jun 7, 2023

Yes, I included 'other'. This should be in nmdc-schema 7.6.0 and submission-schema 7.6.0 now. I'll confirm in a few minutes.

@turbomam
Copy link
Member Author

turbomam commented Jun 7, 2023

confirmed: submission schema 7.6.0 updated as described

@turbomam turbomam moved this from 🏗 In progress to 👀 In review/Pending Release in SubPort Squad Issues Jun 7, 2023
@mslarae13
Copy link

Thanks @turbomam

@pkalita-lbl can we get this change propagated to the submission schema?

@pkalita-lbl
Copy link

If I'm reading Mark's comments correctly these changes went into submission schema v7.6.0. A later version of the submission schema (v7.6.5) is already used by the portal codebase but it hasn't been released to production yet. So I would expect you'd be able to see this in dev right now.

@mslarae13 mslarae13 added the Prod code release needed Updates made, production not currently using this version. Code release required to see in prod label Jul 7, 2023
@ssarrafan
Copy link

Schema updates have been done since so closing this issue.

@github-project-automation github-project-automation bot moved this from 👀 In review/Pending Release to ✅ Done in SubPort Squad Issues Dec 21, 2023
@ssarrafan ssarrafan removed the backlog label Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Prod code release needed Updates made, production not currently using this version. Code release required to see in prod
Projects
Status: ✅ SubPort 1 - Done
Development

No branches or pull requests

4 participants