Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for cpl.hi files #46

Closed
klindsay28 opened this issue Jun 8, 2021 · 9 comments
Closed

support for cpl.hi files #46

klindsay28 opened this issue Jun 8, 2021 · 9 comments

Comments

@klindsay28
Copy link

I'm inspecting high-frequency output from a CESM experiment that @mvertens ran.
There is output every timestep from different components of CESM.
I decided to try ecgtools and intake-esm to get multi-file Datasets for each component.
I'm gettng tripped up on cpl.hi files.

The experiment was run without short-term archiving. So I soft-linked the history files to another directory, and ran Builder on that directory. (This was to avoid dealing with non-history files in the run directory.)

The directory where I soft-linked the history files is
/glade/scratch/klindsay/SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03/hist

The command I'm using for Builder is

CASE = "SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03"
USER = "klindsay"
DIR = f"/glade/scratch/{USER}/{CASE}/hist"
b = Builder(
    # Directory with the output
    f"/glade/scratch/{USER}/{CASE}/hist",
    # Depth of 1 since all of the history files are in a flat directory
    depth=1,
    # Use the parse_cesm_history parsing function
    parsing_func=parse_cesm_history,
)
print(b)
b = b.build()
print(b)

I'm getting invalid assets for the cpl.hi files. An example traceback for the cpl.hi files is

Traceback (most recent call last):
  File "/glade/work/klindsay/analysis/notebooks/ecgtools/ecgtools/parsers/cesm.py", line 143, in parse_cesm_history
    if info['stream'] is None:
KeyError: 'stream'

What needs to be done to get support for these files into ecgtools? I'm guessing that it's easy for someone that understands the code in ecgtools, but I don't fall into that category.

@mgrover1
Copy link
Contributor

mgrover1 commented Jun 9, 2021

@klindsay28 we would need to add that to the list of streams to look for - which can be found in the following lines
https://github.com/NCAR/ecgtools/blob/main/ecgtools/parsers/cesm.py#L12#L50

I can go ahead and submit a PR for this. The process would be to fork this repository, and add another dictionary entry in the following format:
{'cpl.hi': 'component':'some_component', 'frequency':'some_frequency}

So the main questions to answer are:

  • Which component is cpl? (ex. lnd, atm, ocn)?
  • What is the frequency?

Taking a look the dataset, it looks like it is missing the time_period_freq attribute which is typically in CESM2 history files - is this an instantaneous file?

@mgrover1
Copy link
Contributor

mgrover1 commented Jun 9, 2021

Also, I noticed that pop.h stream in this case has a time frequency of step_1. What does this mean? Is this something that will be in future releases of CESM?

@mnlevy1981
Copy link

@mgrover1 I'll try to hit all of the questions in one go:

Which component is cpl?

It's the flux coupler -- components don't pass fluxes directly to each other, instead they are passed to the coupler which then passes them on to the correct component (the coupler is also responsible for mapping from one grid to another if necessary).

What is the frequency?

From the time stamps, it looks like it is being written every 1800 seconds:

$ ls *.cpl.*
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-01800.nc
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-03600.nc
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-05400.nc
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-07200.nc
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-09000.nc
SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03.cpl.hi.0001-01-01-10800.nc

This isn't an optimal way to determine the frequency, as it's possible to put multiple time slices in a single file (e.g. the pop.h.nday1 file typically has daily averages but combines 28-31 of them into a single monthly file) but the cpl.hi files have a single time level

Taking a look the dataset, it looks like it is missing the time_period_freq attribute which is typically in CESM2 history files - is this an instantaneous file?

I'll let @klindsay28 verify, but I would assume so. A common reason to look at high-frequency output with coupler history files turned on is to ensure the mapping is done right and that the component models are receiving the fluxes correctly (if things look okay on the source grid in the coupler history file but not the destination grid then something happened in the mapping process; if it looks okay on the destination grid but not in the component that received the flux then something happened when unpacking). That's easiest to diagnose with instantaneous output. Also, I think the i in cpl.hi (as well as cam.i) stands for instantaneous.

Also, I noticed that pop.h stream in this case has a time frequency of step_1. What does this mean? Is this something that will be in future releases of CESM?

It means POP is writing output every time step. It's not common in production runs, but it's pretty typical when trying to track down a bug. I don't think ecgtools needs to do anything special with them, just let users search the catalog for it the same way they would pull out monthly, daily, or annual output.

@klindsay28
Copy link
Author

Thanks @mnlevy1981.

@mgrover1, the entries in _STREAMS_DICT make me wonder if ecgtools supports the use case that I'm in.
Take for instance the line

    'cam.h0': {'component': 'atm', 'frequency': 'month_1'},

What is ecgtools doing with this frequency information?

In this experiment I'm looking at, the cam.h0 files are not monthly. The fields are written every timestep, which is every 30 minutes in this case. Is ecgtools assuming that cam.h0 files are always monthly?

The same question applies to clm2.h0 and pop.h files.

@mgrover1
Copy link
Contributor

mgrover1 commented Jun 9, 2021

@klindsay28 the frequency information is used for CESM1 output - in CESM2, the frequency information is written in the attributes in the file. In your case, the frequency information is extracted from all those streams you listed

  • cam.h0
  • clm2.h0
  • pop.h

@mnlevy1981
Copy link

the frequency information is used for CESM1 output - in CESM2, the frequency information is written in the attributes in the file

@mgrover1 it might be clearer if you renamed frequency? cesm1_assumed_frequency is kind of long, and default_frequency doesn't capture when non-default values are used... hmm, maybe fallback_frequency to denote it's what the tool falls back to if there is not a time_period_freq attribute?

@klindsay28
Copy link
Author

Here's my understanding. Please correct me if I'm off.

If files have the time_period_freq as a file attribute, frequency is set to that.
Otherwise, ecgtools looks up frequency in _STREAMS_DICT.

In the cpl.hi files that I'm looking at, time_period_freq is not present. Additionally, there is no cpl.hi entry in _STREAMS_DICT. So ecgtools categorizes the files as invalid assets.

For me to proceed, I need edit parsers/cesm.py, adding a cpl.hi entry to _STREAMS_DICT, with frequency set to what it is in these experiments. This assumes I have an editable install of ecgtools in my environment. Additionally, if I looked at another experiment with a different frequency for cpl.hi, I would need to re-edit parsers/cesm.py. That seems like a pain.

What do you think of adding an optional dict argument to build that allows the user to specify additional entries for _STREAMS_DICT? I'm not sure how to make this play nice with other parsers.

@mgrover1
Copy link
Contributor

mgrover1 commented Jun 10, 2021

@klindsay28 that process is correct - I submitted a PR changing the API such that you can input a yaml file into the parser such that you can easily add additional streams/individually change the default_frequency for CESM1.

Thanks for taking the time to test this out - this feedback is helpful!

@mgrover1
Copy link
Contributor

@klindsay28 the recent PR #50 fixes this problem! We added an update to the documentation, with the new method of adding new streams being the following:

from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_history

# Create a dictionary with stream information
new_streams = {'cpl.hi': {'component': 'coupler', 'frequency': 'instantaneous'}}

CASE = "SMS_Vmct_Ln6.f19_g17.1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_SGLC_WW3_BGC%BPRP.cheyenne_intel.validate_beta03"
USER = "klindsay"
DIR = f"/glade/scratch/{USER}/{CASE}/hist"

# Setup your builder
b = Builder(
    # Directory with the output
    f"/glade/scratch/{USER}/{CASE}/hist",
    # Depth of 1 since all of the history files are in a flat directory
    depth=1,
    # Use the parse_cesm_history parsing function
)

# Build the catalog and include which parser, along with the kwarg which allows you enter the new/updated streams
b = b.build(parse_cesm_history, 
            parsing_func_kwargs={'user_streams_dict':new_streams})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants