Generic interface #70

sherimickelson · 2020-04-20T15:56:07Z

This version seeks to work on generic datasets, including time slice files that contain multiple time variant variables.

The current interface takes in a yaml file and produces the catalog information needed by intake-esm (both a csv and json file).

Here's an example of the proposed yaml interface:

my_experiment1:
    experiment_name: cesm_experiment_1
    member_id: '001'
    data_sources:
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/atm/hist/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/lnd/hist/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/ocn/hist/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1//ice/hist/*cice.h.*
      model_name: cice
      time_freq: month_1

my_experiment2:
    experiment_name: mpas_experiment_1
    member_id: '001'
    data_sources:
    - glob_string: /the/path/to/data/MPAS_DATA/mpas_experiment_1/*
      model_name: mpas
      time_freq: model_step

The interface contains two tiers and follows a specific format. The first tier must contain the key data_sources. Any other keys added at this tier will be treated as global key/value pairs (or pandas df column), added to all data sources under this experiment. The second tier must contain the key glob_string. This string should glob all files that should be included within this "chunk". These files will contain all of the key/value pairs (or pandas df columns) under this data source. Just like the first tier, these keys can be whatever the user would like.

These changes will not work yet under intake-esm. We still need to commit the changes needed to get lists of variables working. This work has also been started.

files in a path and records all variables in that file.

sherimickelson · 2020-04-20T17:11:25Z

This helps address issue #55.

builders/tslice.py

andersy005

Thank you for putting this together, @sherimickelson! I left some minor comments

andersy005 · 2020-04-21T21:10:14Z

builders/tslice.py

+        look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims]
+        unlim=[i for i, x in enumerate(look_for_unlim) if x]
+        unlim_dim=dims[unlim[0]]
+        fileparts['time_range'] = str(d[unlim_dim][0])+'-'+str(d[unlim_dim][-1])


I tried running this code section using one of the WRF files: /glade/collections/cdg/work/cordex/raw/wrf-era-25/1979/2D/wrfout_d01_1979-03-28_00:00:00, but it fails:

Do you know what's happening?

It would be nice to have a set of files from different models that we could use to set up a testing framework for this. I am happy to help set this up.

I have not tested this on WRF files, but from what I remember, I think this isn't working because Time is just a coordinate and not a variable. So I'll need another way to get the time range for WRF files.

Thanks @andersy005 for looking at this draft. That would be great if we can create a set of files we can test with.

@sherimickelson When you say "Time is just a coordinate," are you saying "Time is just a dimension"?

I ask because in xarray (and according to the CF conventions) a "coordinate" is a type of "variable." And a "dimension" is just a name with a size.

@kmpaul, yes, I'm sorry. It is just a dimension.

andersy005 · 2020-04-21T23:22:13Z

builders/tslice.py

+        d = nc.Dataset(filepath,'r')
+        # find what the time (unlimited) dimension is
+        dims = list(dict(d.dimensions).keys())
+        look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims]
+        unlim=[i for i, x in enumerate(look_for_unlim) if x]
+        unlim_dim=dims[unlim[0]]


This is going to fail when unlim is an empty list:

@andersy005 good catch. I'll look at catching this.

@andersy005 I'm using this code to include only time variant fields in the variable list. Should it include all variables instead of limiting what is included?

Should it include all variables instead of limiting what is included?

Thinking about this a bit, I don't really know whether we should exclude time invariant fields or not. I am going to let others chime in. Cc @kmpaul

If we include only time variant fields in the variable, is there any guarantee that the unlimited dimensions are always specified in the netCDF file?

Usually, time-invariant fields are "pseudo-coordinates"...or metadata fields. But I don't think we can guarantee that is true for all data. Maybe we should conform to xarray's standard practice, which I think is to include everything in the data variables unless it can be unambiguously identified as a "coordinate".

I don't think you can guarantee that "unlimited" dimensions will even exist. My understanding is that some data, for reasons I cannot remember right now, explicitly removes the "unlimited" dimension from all files.

I'll implement @kmpaul 's suggestion.

mnlevy1981 · 2020-04-30T20:46:23Z

@sherimickelson and I just chatted, and she asked if I could share one of my existing YAML templates and the corresponding csv.gz file. Links are to versions of the file available on github:

/glade/work/mlevy/codes/cesm2-marbl/notebooks/intake-esm-collection-defs/glade-cesm2-cmip6-collection.yaml

was used to produce

/glade/work/mlevy/intake-esm-collection/csv.gz/campaign-cesm2-cmip6-timeseries.csv.gz

The process there was to use a legacy version of intake-esm to generate a netCDF file, and then

/glade/work/mlevy/codes/cesm2-marbl/notebooks/build intake collections.ipynb (it's possible that the cesm2-cmip6 portion of that notebook has been commented out on disk in the time that has passed since generating the files, but I linked an older version)

Some comments from our chat that I think are worth putting in writing here:

Overall, I think this sort of generic interface is exactly what we need to be able to generate catalogs in a consistent manner
It would be useful to have multiple ensemble members be able to share glob_string. If some output is only available for some members (such as ocean BGC output in the large ensemble), intake-esm should use missing values for the members without the data so averaging and things will still work
I'd like a way to add columns to the csv file. E.g. with ensembles, it's useful to track provenance of each run -- knowing what run it split off from, in what year, etc etc

Below is the block defining the SSP5-8.5 ensemble generated by CESM. We include case for each ensemble member, which lets us know what to look for in urlpath (rather than repeating urlpath for each member), and we can see that 001 spun off from historical run 010 while 002 spun off from 011 -- this is very useful for making plots like cell 10 of forcing_iron_flux.ipynb:

SSP5-8.5:
  locations:
    - name: glade
      loc_type: posix
      direct_access: True
      urlpath: /glade/campaign/collections/cmip/CMIP6/timeseries-cmip6
      exclude_dirs: ['*.nc_temp_.nc']
  extra_attributes:
    component_attrs:
      ocn:
        grid: POP_gx1v7
    case_members:
      - case: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.001
        ctrl_member_id: 10
        ctrl_experiment: historical
        ctrl_branch_year: 2015
      - case: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.002
        ctrl_member_id: 11
        ctrl_experiment: historical
        ctrl_branch_year: 2015

sherimickelson · 2020-05-04T21:58:25Z

Look into adding a YAML validator. Possible choices:
https://github.com/23andMe/Yamale
https://pykwalify.readthedocs.io/en/master/
Others?

…ile.

kmpaul

This is looking really nice. Is there a test for this?

builders/tslice.py

kmpaul · 2020-05-14T15:23:49Z

builders/tslice.py

                filelist = get_asset_list(stream_info['glob_string'], depth=0)
+                stream_info.pop('glob_string')


Make use of the pop call:

Suggested change

filelist = get_asset_list(stream_info['glob_string'], depth=0)

stream_info.pop('glob_string')

glob_string = stream_info.pop('glob_string')

filelist = get_asset_list(glob_string, depth=0)

There are no tests for this yet. I just wanted to push my latest working version.

Thanks for the review @kmpaul

builders/tslice.py

Thanks for catching this. This was left over from an earlier version. Co-authored-by: Kevin Paul <[email protected]>

Co-authored-by: Kevin Paul <[email protected]>

kmpaul

Missed something.

builders/tslice.py

Co-authored-by: Kevin Paul <[email protected]>

sherimickelson · 2020-05-20T23:15:38Z

Example yaml file that contains the nested globs:

SSP5-8.5:
    ctrl_experiment: historical
    ensemble:
    - glob_string: /glade/collections/cdg/timeseries-cmip6//b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.101/*/*/*/*/*.nc
      experiment_name: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.101 
      member_id: '001'
      ctrl_member_id: 10
      ctrl_branch_year: 2015
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.102/*/*/*/*/*.nc
      experiment_name: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.102
      member_id: '002'
      ctrl_member_id: 11
      ctrl_branch_year: 2015
    data_sources:
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/atm/proc/tseries/month_1/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/lnd/proc/tseries/month_1/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/ocn/proc/tseries/month_1/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/ice/proc/tseries/month_1/*cice.h.*
      model_name: cice
      time_freq: month_1

This format requires the key,ensemble, if an ensemble exists. It always requires the key, data_sources. Both keys expect to have lists of dictionaries and each of those must contain the key, glob_string.
Beyond the above requirements, users can add whatever key/value pairs they would like. These key/value pairs are assigned to anything lower in the data structure. For example, the key/value pair, ctrl_experiment: historical, in the above yaml file, will be assigned to all rows under SSP5-8.5.

sherimickelson · 2020-05-28T16:45:07Z

Created a schema from the yaml interface that validates with Yamale.
It validates on the command line

yamale -s generic_schema.yaml my_catalog.yaml

Example yaml:

experiment:
    ctrl_experiment: piControl
    ensemble:
    - glob_string: /home/user/CESM_DATA/historical.001/*/*/*.nc
      experiment_name: historical.001 
      member_id: '001'
      ctrl_branch_year: 631
    - glob_string: /home/user/CESM_DATA/historical.002/*/*/*.nc
      experiment_name: historical.002
      member_id: '002'
      ctrl_branch_year: 661
    data_sources:
    - glob_string: /home/user/CESM_DATA/historical.*/atm/hist/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/lnd/hist/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/ice/hist/*cice.h.*
      model_name: cice
      time_freq: month_1

experiment:
    experiment_name: mountain_wave
    member_id: '001'
    data_sources:
    - glob_string: /home/user/MPAS_DATA/mountain_wave/*
      model_name: mpas
      time_freq: model_step

experiment:
    experiment_name: wrf_test
    member_id: '001'
    data_sources:
    - glob_string: /home/user/WRF_DATA/wrf-era-25/1979/*/wrfout_*
      model_name: wrf
      time_freq: model_step

Validates with this schema:

experiment: 
    ensemble: list(include('ensembleL'), required=False)
    data_sources: list(include('data_sourcesL'))
---
ensembleL:
    glob_string: str()
data_sourcesL:
    glob_string: str()

The validation step will be added to the Python code in builders/tslice.py in a future version.

…mitted yesterday.

… criteria that exists within the schema

sherimickelson · 2020-05-29T16:34:51Z

While adding the schema validation into the code, a couple of issues showed up. This required a modification to the example yaml as well as the schema. The new versions follow:

Example yaml that defines what should be in the catalog

catalog:
- experiment:
  ctrl_experiment: piControl
  ensemble:
  - glob_string: /home/user/CESM_DATA/historical.001/*/*/*.nc
    experiment_name: historical.001 
    member_id: '001'
    ctrl_branch_year: 631
  - glob_string: /home/user/CESM_DATA/historical.002/*/*/*.nc
    experiment_name: historical.002
    member_id: '002'
    ctrl_branch_year: 661
  data_sources:
  - glob_string: /home/user/CESM_DATA/historical.*/atm/hist/*.cam.h0.*
    model_name: cam
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/lnd/hist/*.clm2.h0.*
    model_name: clm
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*.pop.h.*
    model_name: pop
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/ice/hist/*.cice.h.*
    model_name: cice
    time_freq: month_1

- experiment:
  experiment_name: mountain_wave
  member_id: '001'
  data_sources:
  - glob_string: /home/user/MPAS_DATA/mountain_wave/*
    model_name: mpas
    time_freq: model_step

- experiment:
  experiment_name: wrf_test
  member_id: '001'
  data_sources:
  - glob_string: /home/user/WRF_DATA/wrf-era-25/1979/*/wrfout_*
    model_name: wrf
    time_freq: model_step

The new schema used to validate

catalog: list(include('experiment'))
---
experiment:
    ensemble: list(include('ensembleL'), required=False)
    data_sources: list(include('data_sourcesL'))
ensembleL:
    glob_string: str()
data_sourcesL:
    glob_string: str()

sherimickelson · 2020-05-29T16:37:01Z

Code now is able to validate the input yaml against the schema with Yamale and internally if Yamale is not available. Code checks for a successful import, if it fails, it does the validation internally.

sherimickelson · 2020-06-02T22:51:06Z

Based on discussions I've added the ability to use netcdf file/variable attributes within the yaml file. This is related to #64

The yaml syntax to use this feature is:

  - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*.pop.h.*
    model_name: pop
    time_freq: <time_period_freq>
    long_name: <<long_name>>
    units: <<units>>

<var> implies to use the value for global attribute 'var'
<<var>> implies to use the value for the variable attribute 'var'

andersy005 · 2020-09-23T18:37:52Z

Closing this as it appears to have been addressed in ncar-xdev/ecgtools#5

sherimickelson added 4 commits April 1, 2020 16:37

Modeled off of the other scripts. This version opens the netcdf

325c3ef

files in a path and records all variables in that file.

Swtich to using netcdf4 instead of xarray and more features.

29ae2af

Modifications to work with yaml input.

0228881

Add error handling and more comments

8f551c6

kmpaul requested a review from a team April 20, 2020 20:24

andersy005 reviewed Apr 21, 2020

View reviewed changes

builders/tslice.py Show resolved Hide resolved

andersy005 reviewed Apr 21, 2020

View reviewed changes

sherimickelson added 2 commits April 22, 2020 14:38

add all vars that are not coords and catch for files w no unlimited dim

be398ff

Merge remote-tracking branch 'upstream/master' into generic-interface

89cebcf

Added code to start assigning attributes based on globs within yaml f…

a8d2a80

…ile.

kmpaul suggested changes May 14, 2020

View reviewed changes

sherimickelson and others added 5 commits May 14, 2020 11:15

Update builders/tslice.py

d700fa7

Thanks for catching this. This was left over from an earlier version. Co-authored-by: Kevin Paul <[email protected]>

Update builders/tslice.py

a1c3f27

Co-authored-by: Kevin Paul <[email protected]>

Update builders/tslice.py

d5f26f0

Co-authored-by: Kevin Paul <[email protected]>

Update builders/tslice.py

5796e60

Co-authored-by: Kevin Paul <[email protected]>

Update builders/tslice.py

7fe8763

Co-authored-by: Kevin Paul <[email protected]>

kmpaul suggested changes May 14, 2020

View reviewed changes

builders/tslice.py Outdated Show resolved Hide resolved

Update builders/tslice.py

69c7026

Co-authored-by: Kevin Paul <[email protected]>

Add schema for validation. Minor fixes to tslice.py

77e7f5c

sherimickelson added 2 commits May 29, 2020 08:46

Added yaml schema checking into code. Required mods to the schema com…

a2e2d52

…mitted yesterday.

Add to the internal validation so that it matches and checks the same…

7200087

… criteria that exists within the schema

Added in ability to pull in netcdf attribute information.

d743c99

This was referenced Jun 3, 2020

Add core functionality for parsing attributes from an open file ncar-xdev/ecgtools#3

Merged

Gen cesm catalog NCAR/cesm-catalog#5

Draft

andersy005 closed this Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic interface #70

Generic interface #70

sherimickelson commented Apr 20, 2020

sherimickelson commented Apr 20, 2020

andersy005 left a comment

andersy005 Apr 21, 2020

andersy005 Apr 21, 2020

sherimickelson Apr 21, 2020

sherimickelson Apr 21, 2020

kmpaul Apr 22, 2020

kmpaul Apr 22, 2020

sherimickelson Apr 22, 2020

andersy005 Apr 21, 2020 •

edited

Loading

sherimickelson Apr 22, 2020

sherimickelson Apr 22, 2020

andersy005 Apr 22, 2020

kmpaul Apr 22, 2020

sherimickelson Apr 22, 2020

mnlevy1981 commented Apr 30, 2020

sherimickelson commented May 4, 2020

kmpaul left a comment

kmpaul May 14, 2020

sherimickelson May 14, 2020

sherimickelson May 14, 2020

kmpaul left a comment

sherimickelson commented May 20, 2020

sherimickelson commented May 28, 2020

sherimickelson commented May 29, 2020

sherimickelson commented May 29, 2020

sherimickelson commented Jun 2, 2020

andersy005 commented Sep 23, 2020

		filelist = get_asset_list(stream_info['glob_string'], depth=0)
		stream_info.pop('glob_string')

Generic interface #70

Generic interface #70

Conversation

sherimickelson commented Apr 20, 2020

sherimickelson commented Apr 20, 2020

andersy005 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andersy005 Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnlevy1981 commented Apr 30, 2020

sherimickelson commented May 4, 2020

kmpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmpaul left a comment

Choose a reason for hiding this comment

sherimickelson commented May 20, 2020

sherimickelson commented May 28, 2020

sherimickelson commented May 29, 2020

sherimickelson commented May 29, 2020

sherimickelson commented Jun 2, 2020

andersy005 commented Sep 23, 2020

andersy005 Apr 21, 2020 •

edited

Loading