Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: New configs #33

Closed
wants to merge 23 commits into from
Closed

WIP: New configs #33

wants to merge 23 commits into from

Conversation

jessicaaustin
Copy link

@jessicaaustin jessicaaustin commented Mar 20, 2020

Major refactor of how test configurations are specified. Allows for storage of generic configuration objects that specify test configs based on variable, region, and time window. This is useful for applying test configs to moving platforms, such as gliders, animal tracks, etc.

The previous config object, QcConfig, still exists and functions the same, but is now known as StreamConfig.

TODO what are the major changes around storage of results?

@kwilcox
Copy link
Member

kwilcox commented Aug 20, 2020

@jessicaaustinaxiom The PR is in decent shape for a usability review. Please take a look!

@jessicaaustin
Copy link
Author

@kwilcox awesome! I will take a look next week and please yell at me if I don't.

@kwilcox
Copy link
Member

kwilcox commented Sep 8, 2020

🙀

@jessicaaustin
Copy link
Author

@kwilcox can you please rebase from master? Looks like this is 10 commits behind and I got conflicts when I tried.

@jessicaaustin
Copy link
Author

jessicaaustin commented Oct 20, 2020

@kwilcox some general thoughts on the configs:

  • StreamConfig (formely QcConfig) -- this is what i'm using currently for sensor qc
    • i have my own database table that stores a streamconfig per parameter (https://oikos.axds.co/rest/sensors/qcConfigs)
    • so while the individual configs are fairly portable, if i handed this db to someone they would have to understand what a "parameter" is. in other words, the store itself is not portable
    • but that's exactly what ContextConfig is meant to help with...
  • ContextConfig -- this defines StreamConfig's per variable
    • i was thinking about how i would use this instead of our custom db... and for the most part it would work. but the one problem is the fact that stream_id is the variable name
    • this prevents this config from being truly generic, since if i changed the name of my variable (which doesn't really change the representation of the underlying data, and thus shouldn't affect the config) then i need to update my config too
    • i think using the variable name is a valid approach, and would certainly work for some situations -- for example if you are already maintaining a database of variables somewhere and just want to key off of that (like we do, or like ooi does)
    • but it would be nice to specify other attributes that should "categorize" a config -- attributes that define what the variable is. for example, standard_name + units. or instrument make_model

For example, say you have the following config, based on variable names:

    config = """
        streams:
            atmp:
                qartod:
                    gross_range_test:
                        suspect_span: [-30, 30]
                        fail_span: [-80, 140]
            sal:
                qartod:
                    gross_range_test:
                        suspect_span: [0, 30]
                        fail_span: [0, 40]
    """

Here's a version that's more generic:

    config = """
        streams:
            some_stream_id_that_doesnt_really_matter_1:
                attrs:
                    standard_name: air_temperature
                    units: degree_Celsius
                qartod:
                    gross_range_test:
                        suspect_span: [-30, 30]
                        fail_span: [-80, 140]
            some_stream_id_that_doesnt_really_matter_2:
                attrs:
                    standard_name: sea_water_practical_salinity
                    units: 1
                qartod:
                    gross_range_test:
                        suspect_span: [0, 30]
                        fail_span: [0, 40]
    """

This could be implemented by having the StreamConfig constructor look for a specific attrs key and store it separately from the test config. Then specific stream implementations, like NetcdfStream, could use the attrs to match a var, rather than the var name. A fallback could be if the attrs are not specified, then it uses the stream_id as the variable name just like you have already.

Thoughts?

@kwilcox
Copy link
Member

kwilcox commented Oct 20, 2020

oooo great points, this is exactly what we needed to discuss. Few counter points for discussion:

Here's a version that's more generic:

I see this as less generic because it is now specific to a stream format that understands what a standard_name is.

but it would be nice to specify other attributes that should "categorize" a config -- attributes that define what the variable is. for example, standard_name + units. or instrument make_model

I see that as a job for someone using/implementing the ioos_qc library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig should be used to locate/read the data from any Stream based object. That ends up being variable names in netCDF files, column names in dataframes, and potentially a topic name in a kafka stream.

I do like the idea of an optional attrs field that any Stream based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key. It also becomes easier if we just make streams a list, I never made the jump because time.

streams:
  - key: some_stream_id
    attrs:
      name: My Fancy Description
      summary: Some summary of the checks/variable/whatever 
      standard_name: air_temperature
      units: degree_Celsius
    checks:
      qartod:
          gross_range_test:
          suspect_span: [-30, 30]
          fail_span: [-80, 140]
  - key: some_stream_id
    attrs:
      standard_name: sea_water_practical_salinity
      units: 1
    checks:
      qartod:
        gross_range_test:
          suspect_span: [0, 30]
          fail_span: [0, 40]

@jessicaaustin
Copy link
Author

jessicaaustin commented Oct 21, 2020

I see that as a job for someone using/implementing the ioos_qc library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig should be used to locate/read the data from any Stream based object

Right, this is the main point that we need to clarify -- how much is this the job of ioos_qc vs the user.

In master right now, QcConfig handles defining a set of tests, but it's outside the scope to figure out which QcConfig to apply to a given stream of data. I saw this PR as an opportunity to expand what ioos_qc is capable of, so that you could pull a config from one place, the data from another place, and they would just work together. I think this is really powerful, and builds upon the community standards that promote self-describing datasets. For example, a PI or a manufacturer like Saildrone or an IOOS RA could publish a "default" config for their data that people could start with and build upon. I mean, right now ContextConfig allows you to specify configs based on spatial and temporal constraints, so why is it such a stretch to have it based on variable attributes as well?

Now this this is expanding the scope of this PR, which is primarily about making ioos_qc capable of running on moving platforms. So maybe we should table the conversation. BUT I would like to leave that option open for the future... so then the question is, are we painting ourselves into a corner? Specifically, do we make streams a list now? Or are we ok with adding a special attrs key someday, or implementing some other way of expanding upon which configs are applied where?

Some other points:

I do like the idea of an optional attrs field that any Stream based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key

I agree with this -- attrs wouldn't have to be required, and it would fall back to just using the variable name as the lookup key (which is the current behavior in this PR).

I see this as less generic because it is now specific to a stream format that understands what a standard_name is.

The stream format doesn't need to know what a standard_name is, it just needs to have a concept of attrs and match those up to the attrs in the config. So it could be easily implemented for NetcdfStream and for XarrayStream, but not for PandasStream or NumpyStream. (though I suppose we could modify those constructors to pass in a dict of attrs if we really wanted)

@kwilcox
Copy link
Member

kwilcox commented Oct 22, 2020

Let's add support for the attrs key now and leave everything else the same. I think there is lots of value in adding some additional context and seeing how useful when implementing the streams. It does solve the use case of having "generic" configs that can be applied to certain types of streams but doesn't prevent a QC config from being "just metadata" and not generic.

@jessicaaustin
Copy link
Author

@kwilcox where do we stand with this?

I think this are the outstanding tasks:

  • rebase branch
  • fix notebooks
    • reenable nbsphinx and make sure everything runs
    • update references to QcConfig etc in existing notebooks
    • add new notebooks to the docs
  • finish up docs -- nothing for Stores yet
  • Add support for attrs -- this could be a subsequent PR

Ability to pass JSON/YAML strings and String.IO objects into loader
Starting to move to the new style YAML loading in `ruamel.yaml`
This is for backwards compat. No test changes... it works the same way.
This allows for namedtuple defaults, which are my fav
@kwilcox
Copy link
Member

kwilcox commented Nov 25, 2020

Picked back up in #39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants