Skip to content
This repository has been archived by the owner on Jul 12, 2024. It is now read-only.

bugs in data.tsv (and upstream yaml) from use_modular_gd.py #24

Closed
27 of 31 tasks
turbomam opened this issue Jan 7, 2022 · 6 comments
Closed
27 of 31 tasks

bugs in data.tsv (and upstream yaml) from use_modular_gd.py #24

turbomam opened this issue Jan 7, 2022 · 6 comments
Assignees

Comments

@turbomam
Copy link
Member

turbomam commented Jan 7, 2022

  • if the pattern looks like a list, make it a enumeration/pulldown

    • solved by processing MIxS after NMDC because the list-like patterns came from NMDC and get overwritten by MIxS?
    • they are still in MIxS string serialization
  • add hierarchical indentation of enumerated values

  • Add support for partial date columns and time columns

  • align section composition and ordering with @mslarae13's Example Use tab

    • that info will be added to other more structured tabs
  • tidy the descriptions

  • add meanings for enumerated values

    • lookup with enum_annotator (see example in Makefile)
    • expose as Ontology IDs
    • some enum labels from MIxS have added parenthetic content that makes string matching difficult: meadows (grasses,alfalfa,fescue,bromegrass,timothy)
  • add more patterns based on

    • string_serialization
    • slot's range... pretty thorough at this point
  • populate examples column?

  • are terms being included even though they are marked skip on nmdc_biosample_slots ?

  • Ontology ID

  • terse labels (from apparent prioritization of NMDC over MIxS annotations?)

  • parent classes with "https:" prefixes

    • long-term solution: align section composition and ordering below
    • short-term solution comes from full URLs (prefer prefixed) above
  • seems like number of required fields too low

    • added requirements from slot usages
  • elaborate on the use of regular expressions in the guidance column. Also include the string serialization?

  • Where is the default PV in the sample_type enum coming from

    • @click.option('--default_data_status', default="default", show_default=True)
  • what does the Null values section in the double-click header help mean? see What's the Null values section in the double-click-the-header help? cidgoh/DataHarmonizer#244

    • shows the contents of the data status column in data.tsv, which I was populating with --default_data_status
  • take advantage of min and max values for pH (anything else?)

  • whose id-like fields should be used? The ones from NMDC or ones created by @mslarae13

    • using identifiers from biosample_identification_slots
@turbomam
Copy link
Member Author

turbomam commented Jan 7, 2022

MIxS slots with NMDC URLs like https://microbiomedata/schema/mixs/ph

just put the MIxS term requests second in the tasks dict

This also mostly resolves

terse labels (from apparent prioritization of NMDC over MIxS annotations?)

Improvements on the NMDC terms could be driven to changes to the nmdc schema, or curations in tab nmdc_biosample_slots (like we will probably do for the descriptions.)

@turbomam
Copy link
Member Author

turbomam commented Jan 7, 2022

Ontology ID/full URLs (prefer prefixed)

example: https://microbiomedata/schema/ecosystem

would prefer for that to appear as nmdc:ecosystem

Current definition of nmdc prefix::

nmdc:
    prefix_prefix: nmdc
    prefix_reference: https://microbiomedata/meta/

@turbomam
Copy link
Member Author

turbomam commented Jan 7, 2022

enums included so far, with enrichment success

  • cur_land_use_enum medium, especially if \(.*$ is removed before searching,
  • drainage_class_enum poor
  • fao_class_enum good
  • profile_position_enum errors out on first term, backslope
  • soil_horizon_enum poor
  • tillage_enum errors out on first term, chisel

cmungall added a commit to microbiomedata/nmdc-schema that referenced this issue Jan 7, 2022
turbomam added a commit that referenced this issue Jan 10, 2022
turbomam added a commit that referenced this issue Jan 10, 2022
many improvements from #24

Self-merging in order to see in GH pages
@turbomam
Copy link
Member Author

Tabulation of ranges for NMDC and MIxS as-is only

To-do

external identifier       3
double                    1

enums

cur_land_use_enum         1
drainage_class_enum       1
fao_class_enum            1
profile_position_enum     1
soil_horizon_enum         1
tillage_enum              1

Handled already

string                   39
    xsd:token by default
quantity value           16
date                      3
    xsd:date
    see notes above regarding partial dates and times

turbomam added a commit that referenced this issue Jan 10, 2022
@turbomam
Copy link
Member Author

New tabulations:

TABULATION OF SLOT RANGES, for prioritizing range->regex conversion
string                           38
quantity value                   16
external identifier               3
date                              3
oxygen_relationship_enum          1
storage_condt_enum                1
soil_horizon_enum                 1
sample_type_enum                  1
samp_biotic_relationship_enum     1
profile_position_enum             1
fao_class_enum                    1
growth_facility_enum              1
env_package_enum                  1
drainage_class_enum               1
cur_land_use_enum                 1
analysis_type_enum                1
tillage_enum                      1
dtype: int64


TABULATION OF STRING SERIALIZATIONS, for prioritizing serialization->regex conversion
<none>                                                                                                 30
{PMID}|{DOI}|{URL}                                                                                     20
{text}                                                                                                  8
enumeration                                                                                             7
{float} {unit}                                                                                          4
{termLabel} {[termID]}                                                                                  4
{text};{float} {unit}                                                                                   3
{integer}                                                                                               2
{timestamp}                                                                                             2
{text}:{text}                                                                                           2
{text};{timestamp}                                                                                      1
[summit|shoulder|backslope|footslope|toeslope]                                                          1
{PMID}|{DOI}|{URL}|{text}                                                                               1
{termLabel} {[termID]}|{text}                                                                           1
{{text}|{float} {unit}};{float} {unit}                                                                  1
[O horizon|A horizon|E horizon|B horizon|C horizon|R layer|Permafrost]                                  1
{float}                                                                                                 1
{float} C                                                                                               1
{float} {float}                                                                                         1
{termLabel} {[termID]}; {timestamp}                                                                     1
{term}: {term}, {text}                                                                                  1
[Acrisols|Andosols|Arenosols|Cambisols|Chernozems|Ferralsols|Fluvisols|Gleysols|Greyzems|Gypsisols|     1
[very poorly|poorly|somewhat poorly|moderately well|well|excessively drained]                           1
[cities|farmstead|industrial areas|roads/railroads|rock|sand|gravel|mudflats|salt flats|badlands|pe     1
{boolean};{Rn/start_time/end_time/duration}                                                             1
HH:MM:SS                                                                                                1
{text};{float} {unit};{timestamp}                                                                       1
[drill|cutting disc|ridge till|strip tillage|zonal tillage|chisel|tined|mouldboard|disc plough]         1
dtype: int64

@turbomam turbomam pinned this issue Jan 24, 2022
@turbomam turbomam self-assigned this Feb 7, 2022
@pkalita-lbl
Copy link
Collaborator

Looks like all tasks here are complete except for the one about hierarchical enums which is covered by other issues. Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants