Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update water content validation #143

Closed
mslarae13 opened this issue Jul 26, 2022 · 10 comments
Closed

Update water content validation #143

mslarae13 opened this issue Jul 26, 2022 · 10 comments
Assignees
Labels
interim fix add to issues that should be revisited, may require strategic refactoring invalid This doesn't seem right
Milestone

Comments

@mslarae13
Copy link

Water content and water content method are 2 MIxS fields used in the NMDC submission template.

Currently, MIxS says
water content method

  • Reference or method used in determining the water content of soil
  • PMID,DOI or url | {PMID}|{DOI}|{URL}

water content

  • Water content measurement
  • measurement value | {float}
  • example: gram per gram or cubic centimeter per cubic centimeter

Water content (in soils and sediment can be measured in a variety of ways, hence the water content method fields.
You can use

  • percent dry or wet weight % (75%, 75 %, .75)
  • g of water / g dry soil (5 g water / g dry soil)
  • cubic centimeter per cubic centimeter
  • Water holding capacity (0.75, 75% water, .75 g water per g soil WHC)
  • water filled pore space (60% WFPS)
  • & more, these are just the ones I've used in my research.

All are slightly different formatting.

How do we validate this?

@mslarae13
Copy link
Author

@turbomam FYI

@mslarae13 mslarae13 added the invalid This doesn't seem right label Jul 26, 2022
@turbomam
Copy link
Member

Thanks for the examples.

MIxS specifies that values for water_content (aka "water content", aka vMIXS:0000185) should be measurement values. I don't think measurement value is actually defined in the MIxS Sheets, but @cmungall's team has equated them with NMDC's quantity values, which have the following sub-attributes

DataHarmonizer takes input that's flattened, not structured, so we have translated the MIxS Value syntax of {float} {unit} into a requirement for

  1. a floating point number
  2. followed by exactly one whitespace
  3. followed by a unit string that doesn't include any whitespaces

Informally speaking

That allows us to parse the flattened string from DataHarmonizer into the quantity value structure described above, which should make searching (and possibly even unit conversion) more fruitful.

But it's not compatible with values that you and other scientist use!

I think you are suggesting that we turn all validation off, allowing any string. That would be a quick fix, but it would lead to worse search and unit conversion results.

I would prefer to globally revise quantity value to allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit. Do you think allowing that flexibility would have a bad impact on any of the other fields/columns/slots?

Here's what your examples (plus ont of my own) get parsed into if we send them directly to the quantulum3 parser without any additional validation. Most but not all of them can be parsed faithfully into values and units.

from quantulum3 import parser

examples = [
    "75%",
    "75 %",
    ".75",
    "5 g water / g dry soil",
    "5 cc per cc",
    "5 cc/cc",
    ".75",
    "75% water",
    ".75 g water per g soil WHC",
    "60% WFPS",
    "5 g/g",
]

for ex in examples:
    ex_parsed=parser.parse(ex)
    print(f'"{ex}" is parsed into {ex_parsed}')
  • "75%" is parsed into [Quantity(75, "Unit(name="percentage", entity=Entity("dimensionless"), uri=Percentage)")]
  • "75 %" is parsed into [Quantity(75, "Unit(name="percentage", entity=Entity("dimensionless"), uri=Percentage)")]
  • ".75" is parsed into [Quantity(0.75, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]
  • "5 g water / g dry soil" is parsed into [Quantity(5, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]
  • "5 cc per cc" is parsed into [Quantity(5, "Unit(name="cubic centimetre per cubic centimetre", entity=Entity("unknown"), uri=None)")]
  • "5 cc/cc" is parsed into [Quantity(5, "Unit(name="cubic centimetre per cubic centimetre", entity=Entity("unknown"), uri=None)")]
  • ".75" is parsed into [Quantity(0.75, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")]
  • "75% water" is parsed into [Quantity(75, "Unit(name="percentage", entity=Entity("dimensionless"), uri=Percentage)")]
  • ".75 g water per g soil WHC" is parsed into [Quantity(0.75, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]
  • "60% WFPS" is parsed into [Quantity(60, "Unit(name="percentage", entity=Entity("dimensionless"), uri=Percentage)")]
  • "5 g/g" is parsed into [Quantity(5, "Unit(name="gram per gram", entity=Entity("unknown"), uri=None)")]

PS depth is a quantity value, too. That's why we can retire depth2 now, after converting the current MongoDB contents. I think we may already have some emails or GH issues on that, and I will follow up there.

@turbomam
Copy link
Member

Would also address #140 from @pvangay

@mslarae13
Copy link
Author

@turbomam I'm good with that proposed solution. As long as it will validate. We don't need to make it open string.
@cmungall do you have an opinion?

@pvangay
Copy link

pvangay commented Aug 4, 2022

@turbomam catching up on this issue. I think your proposed solution here makes a ton of sense.

allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit

Am I reading this correctly that these two would incorrectly parse? Is there a way to address this?

"5 g water / g dry soil" is parsed into [Quantity(5, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]
?".75 g water per g soil WHC" is parsed into [Quantity(0.75, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]

Agree that this would definitely also fix #140

@turbomam
Copy link
Member

turbomam commented Aug 4, 2022

I'll make a regexr test page containing my proposed validation and you can try some values that you think should pass and values that you think shouldn't pass. Even better, you could make a list of three or four of each in advance.

The two parsing result you provided are the real output from the value/unit parser we use, quantulum3. Getting those compound units to parse out would require us writing our own custom NMDC value/unit parser, or retraining the quantulum3 parser.

Note that unit parsing and value/unit validation are two different things.

@ssarrafan
Copy link

Based on discussion at Infrastructure sync meeting, adding to the August sprint

@mslarae13
Copy link
Author

Will update submission portal schema to allow for validation to pass. However, the chosen solution makes it difficult to parse the results & will need re-visited. Marking this as the interim fix.

See #148 for next step in correcting this.

@mslarae13 mslarae13 added the interim fix add to issues that should be revisited, may require strategic refactoring label Aug 5, 2022
@turbomam turbomam moved this from To Do to In Progress in NMDC August 2022 Sprint Aug 16, 2022
@ssarrafan
Copy link

@mslarae13 is the interim fix done? Can this issue be closed?

@mslarae13
Copy link
Author

yes. water content validates now

Repository owner moved this from Todo to Done in NMDC September 1-16 2022 Sprint Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interim fix add to issues that should be revisited, may require strategic refactoring invalid This doesn't seem right
Projects
No open projects
Development

No branches or pull requests

4 participants