Update API for CaseClass #30

mnlevy1981 · 2020-09-25T20:53:56Z

The goal for this PR is to expand the API of CaseClass to take advantage of the fact that time-series data is now available on campaign. It would be great if this included support for intake-esm catalogs, but the first pass might not be that sophisticated.

I hope the first commit (361cd79) contains all the actual API changes; I think I'll be able to add the rest of the required functionality deeper in CaseClass.

This first commit is done in anticipation of supporting time series files in addition to history files, and also using intake-esm catalogs behind the scenes. 1. start_date & end_date are not part of the class constructor -- basically all the __init__ routine should do is set up a catalog -- _find_hist_files() also doesn't rely on these variables either; this routine is essentially building the catalog, and we don't restrict the files listed until we are ready to call open_mfdataset() 2. _open_history_files() -> gen_dataset(); this routine is where start_date and end_date are specified, and it returns a dataset rather than updating a class member variable. Logic to apply start_date and end_date at this stage is much simpler than doing it in _find_hist_files() (or I figured out a much easier way to do it). This routine also now expects a list of variable names to include in the dataset. 3. compare_fields_at_lat_lon expects array of DataArrays, not array of cases 4. Added get_varnames_from_metadata_list() to utils/utils.py because I need this function to generate the list of variable names for gen_dataset() in several notebooks. I also created a script that submits jobs to the slurm queue to re-run notebooks, though it does not work for the trend_maps or plot_suite_00[34] notebooks (I think because of the reliance on dask and NCAR_jobqueue).

mnlevy1981 · 2020-09-25T20:55:46Z

Side note: I don't know why github thinks the plots in the notebook have changed. I suspect it's minor version differences in packages like matplotlib between my environment and Keith's? Now that my changes have been committed, I'll rerun some of the notebooks on master and verify that I see the same changes in the images.

gen_dataset() first reads time series from campaign, then reads any additional data from history files (from archive or run directory). There's an API change where, instead of expecting start_date and end_date (strings formatted as 'YYYY-MM'), the function wants integers start_year and end_year. Figuring out better default values will be necessary to extend this to other CESM runs but it's a good start. I split plot_suite_004.ipynb into 4-year segments, as it was choking on 8 years of output. I also added plot_suite_map notebooks for 004 for years 0006 and 0007 (we're one month shy of 0008).

mnlevy1981 · 2020-10-06T17:12:03Z

As of 4c43880 there's a simple interface for reading time series data off of /glade/campaign and what I think is a useful message when merging a dataset from time series files with one from history files:

Time series ends at 0007-01-01 00:00:00, history files begin at 0007-01-01 00:00:00

I ended up splitting plot_suite_004.ipynb into four year chunks because the code was stalling trying to process almost 8 years of data at once; I should probably add 0001-0004 to the file name (and to plot_suite_003.ipynb) to be consistent with the new plot_suite_0005-0008_004.ipynb file.

I wish it was clearer why git is claiming there are diffs in some of the plots, but a quick eyeball check didn't pick up on differences (outside of the change to just 48 months of data in plot_suite_004.ipynb). I'll talk about this PR at the software meeting this afternoon and schedule a more rigorous review in the next day or two.

I had left the "import yaml" statement in from when I toyed with the idea of passing the metadata yaml file to __init__

Concatenating time series dataset with history file dataset now works as expected

Also reran plot_suite notebooks in an environment that will match previous master (note that this changes plots in Sanity Check). Next commit will be updated trend_maps, and the plan is to squash-merge this PR to remove the commits that changed the plots and then changed them back.

Re-ran in an environment that matches the previous master to minimize diffs in the PR.

I didn't realize this notebook hadn't been run in previous commits

mnlevy1981 · 2020-10-08T18:23:01Z

@klindsay28 and I compared notes, and I thought we came up with an environment that would let me recreate the plots on master identically... but it seems to have worked for some notebooks and not others? The following shows how many times image appears in the git diff between master and the version of the notebook on this branch, split between - and +:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 0
+: 2
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 252
+: 252
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 96
+: 96
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 470
+: 480
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

From this, it looks like the plots in the plot_suite_CASE and trend_maps notebooks were identical (hooray!). Notebooks like plot_suite_maps_0001_003.ipynb and plot_suite_maps_0002_003.ipynb) were missing a couple of images on master (perhaps they didn't finish running before the last commit?), but the images in common are identical. But then notebooks like plot_suite_maps_0001_004.ipynb and plot_suite_maps_0002_004.ipynb had some images match, but then many that don't. And the rest of the notebooks seem to have changed every image.

I'm at a loss to explain this behavior, since all the plot_suite_maps files were updated in the same commit (ca8458a). The seemingly obvious answer is that something happened in the time series generation process and the fields are no longer the same, but I confirmed that is not the case in three ways. First, the trend_maps and plot_suite_CASE notebooks are using time series data. Also, I directly compared one month of one of the plotted variables the time series values and the history values:

>>> import xarray as xr ; import numpy as np
>>> casename = 'g.e22.G1850ECO_JRA_HR.TL319_t13.004'
>>> year = '0003'
>>> varname = 'spChl'
>>> ds_hist = xr.open_dataset(f'/glade/scratch/mlevy/archive/{casename}/ocn/hist/{casename}.pop.h.{year}-01.nc')
>>> ds_ts = xr.open_dataset(f'/glade/campaign/cesm/development/bgcwg/projects/hi-res_JRA/cases/{casename}/output/ocn/proc/tseries/month_1/{casename}.pop.h.{varname}.{year}01-{year}12.nc')
>>> da_hist = ds_hist[varname].isel(time=0)
>>> da_ts = ds_ts[varname].isel(time=0)
>>> da_diff = da_hist - da_ts
>>> (da_diff.min().data, da_diff.max().data) # min and max both 0 if values match
(array(0.), array(0.))
>>> np.sum(np.isnan(da_diff) != np.isnan(da_ts)).data # if masks are identical, this will be 0
array(0)

Lastly, I re-ran the notebook, forcing it to use history output with the line case._timeseries_filenames[stream] = []:

plot_suite_maps_0003_004.ipynb (480 images)
----
-: 480
+: 480

To add to the strangeness, it looks like running with history files instead of time series modified 96 plots in my local sandbox (including January spChl that I included in the example above):

$ git diff | grep image | grep ^- | wc -l
96
$ git diff | grep image | grep ^+ | wc -l
96

So 384 plots are identical when my branch reads them from either time series or history files, but all 384 are different than what is on master. For the other 96 plots, there are differences between time series, history, AND master.

I think the path forward might involve

accepting these changes from master even if we can't explain them, and
adding a new notebook that verifies the time series fields match the history file output exactly via checks like what I included above.

mnlevy1981 · 2020-10-08T18:31:25Z

For plot_suite_maps_0003_004.ipynb it looks like I copy / pasted start_year=2, end_year=2 rather than getting year three data. Something similar is likely the case for the other notebooks that are showing every single plot as different (I'll double check after leaving this comment), though it doesn't explain the notebooks where only some of the plots are different.

dcherian · 2020-10-08T18:49:07Z

Are the images visibly different? https://app.reviewnb.com/ could be useful in this repo...

(I know that matplotlib commonly runs in to tiny pixel-level differences the font was rendered very slightly different.)

mnlevy1981 · 2020-10-08T18:57:27Z

Are the images visibly different? https://app.reviewnb.com/ could be useful in this repo...

When I first started making these notebooks, the plots looked the same in the eyeball norm but git claimed they were different. Then I tried to recreate the environment that @klindsay28 used and in the first few notebooks git stopped seeing changes. Then I refactored and got some of the years wrong, so I'm currently re-running with one more refactor that makes it harder to recreate that copy / paste error :) I'll take a closer look at the cases like

plot_suite_maps_0002_004.ipynb (480 images)
----
-: 96
+: 96

After re-running all the notebooks and see how the images compare in the eyeball norm.

mnlevy1981 · 2020-10-08T23:21:38Z

Re-running the script from before, it looks like getting the year correct makes a pretty big difference:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 154
+: 156
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 156
+: 156
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 156
+: 156
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 96
+: 96
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 0
+: 10
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

It's curious to me that the notebooks showing differences in plots all seem to have 324 plots that match and 156 that don't (plot_suite_maps_0004_004.ipynb is an exception to this). If I'm reading the diffs right, it's the NH4 plots and all that follow (whereas for plot_suite_maps_0004_004.ipynb the first different plot is spChl and then there are some matches afterwards?). So I'll see if anything stands out in NH4 on either my branch or the current master.

In the meantime, I'll push notebooks that use da.indentical() to show that the history files and time series files do match (at least for the fields we are plotting)

Some of these were plotting the wrong year; I refactored all of them to be more clear about what is being plotted.

mnlevy1981 · 2020-10-08T23:43:27Z

Okay, I think the mystery might be solved... compare these two images; the first is a plot from master, the second is a plot from my branch:

(hint: look at the title) I have no idea what would cause these strings to flip position, and the weird part is that other plots from 3D fields (e.g. spChl) in the same notebook are the same and match the order of terms in NH4 on master:

So I'm not going to compare 562 plots that have changed, but I suspect they only vary in the order that time and depth appear in the title.

For now the comparisons are done based on variables in diag_metadata.yaml. Required a flag to suppress some CaseClass output.

mnlevy1981 · 2020-10-09T00:32:03Z

@dcherian -- any idea why _title_for_slice() would give different results on successive calls? Can I specify an order for da.coords at some point in the process or does it need to be determined automatically? I can recreate the issue because

import utils

casename = "g.e22.G1850ECO_JRA_HR.TL319_t13.003"
case = utils.CaseClass(casename)
ds = case.gen_dataset('NH4', 'pop.h', start_year=1, end_year=1)

one_dims=[]
for dim, coord in ds['NH4'].isel(z_t=0, time=0).coords.items():
    if coord.size == 1:
        one_dims.append(
            "{dim} = {v}".format(dim=dim, v=coord.values)
        )

print(one_dims)

sometimes prints['time = 0001-01-16 12:15:50', 'z_t = 500.0'] and other times prints ['z_t = 500.0', 'time = 0001-01-16 12:15:50']

All plot_suite_maps now match master

mnlevy1981 · 2020-10-09T06:38:42Z

Well, I updated xarray to 16.1 (from 16.0), and things seem to match master now:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 0
+: 2
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 0
+: 10
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

I upgraded because I thought that pydata/xarray#4409 looked promising (although it's not clear to me when coords were getting shuffled, given I was using open_mfdataset to read a single time series file and not doing any merging or concatenating) and as of 53c1f9f the diffs of the .ipynb files are meaningful.

klindsay28 · 2020-10-09T12:14:40Z

@mnlevy1981 , statements of the form ds_subset = ds[varnamelist] can trigger the non-determinism described in pydata/xarray#4409. See my comment at link.

I think that you are doing this with code like xr.open_mfdataset(...)[[varname] + _vars_to_keep].

mnlevy1981 · 2020-10-09T14:39:39Z

@klindsay28 Thanks for pointing that out! I thought I had read the entire PR conversation, but I must have gone through one of the issue tickets addressed by the PR instead. I'm a little confused about why your plot titles seem to be consistent despite the comment you linked to - did you end up using pip to install xarray from github after this PR was merged? Maybe conda wasn't smart enough to realize that and dumped the old version when you ran conda export. Not a pressing matter, I think it's just the last step in figuring out how our environments differed.

dcherian · 2020-10-09T15:44:32Z

glad this is fixed.

* run_notebook.sh launches a single notebook via slurm, and run_all.sh is now run_all.py (python-script) that calls run_notebook.sh * _find_timeseries_files() now includes pop.h.nyear1 files * gen_dataset(): 1. no longer assume time_bound is the name of the variable (look at bounds property of ds["time"] 2. if bounds are not decoded, still update start_year before looking for history files (via decoding time, which may not be best solution) * use list.extend() instead of += operator

mnlevy1981 added 3 commits September 28, 2020 12:26

Downgrade nbconvert, remove some notebook metadata

390ecc2

Add list of time series files to CaseClass

5309e40

mnlevy1981 marked this pull request as ready for review October 6, 2020 17:06

mnlevy1981 added 5 commits October 6, 2020 11:28

CaseClass doesn't need yaml

3ee0d33

I had left the "import yaml" statement in from when I toyed with the idea of passing the metadata yaml file to __init__

Improve xr.concat call

b3d38bb

Concatenating time series dataset with history file dataset now works as expected

Re-run trend_maps

ebcdf70

Re-ran in an environment that matches the previous master to minimize diffs in the PR.

run plot_suite_maps_0001_004

f9657d8

I didn't realize this notebook hadn't been run in previous commits

Reran plot_suite_maps notebooks

3ae8225

Some of these were plotting the wrong year; I refactored all of them to be more clear about what is being plotted.

New: compare time series to history files

cf9c15a

For now the comparisons are done based on variables in diag_metadata.yaml. Required a flag to suppress some CaseClass output.

Re-run notebooks with xarray 0.16.1

53c1f9f

All plot_suite_maps now match master

mnlevy1981 added 3 commits October 9, 2020 14:23

Clean up run_notebooks.sh

c013454

A little more documentation in run_notebooks.sh

e2c58b7

mnlevy1981 merged commit a58369e into marbl-ecosys:master Oct 9, 2020

mnlevy1981 deleted the time-series-API branch November 20, 2020 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update API for CaseClass #30

Update API for CaseClass #30

mnlevy1981 commented Sep 25, 2020

mnlevy1981 commented Sep 25, 2020

mnlevy1981 commented Oct 6, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

dcherian commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 9, 2020

mnlevy1981 commented Oct 9, 2020

klindsay28 commented Oct 9, 2020

mnlevy1981 commented Oct 9, 2020

dcherian commented Oct 9, 2020

Update API for CaseClass #30

Update API for CaseClass #30

Conversation

mnlevy1981 commented Sep 25, 2020

mnlevy1981 commented Sep 25, 2020

mnlevy1981 commented Oct 6, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

dcherian commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 8, 2020

mnlevy1981 commented Oct 9, 2020

mnlevy1981 commented Oct 9, 2020

klindsay28 commented Oct 9, 2020

mnlevy1981 commented Oct 9, 2020

dcherian commented Oct 9, 2020