Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update API for CaseClass #30

Merged
merged 15 commits into from
Oct 9, 2020
Merged

Conversation

mnlevy1981
Copy link
Contributor

The goal for this PR is to expand the API of CaseClass to take advantage of the fact that time-series data is now available on campaign. It would be great if this included support for intake-esm catalogs, but the first pass might not be that sophisticated.

I hope the first commit (361cd79) contains all the actual API changes; I think I'll be able to add the rest of the required functionality deeper in CaseClass.

This first commit is done in anticipation of supporting time series files in
addition to history files, and also using intake-esm catalogs behind the
scenes.

1. start_date & end_date are not part of the class constructor
   -- basically all the __init__ routine should do is set up a catalog
   -- _find_hist_files() also doesn't rely on these variables either; this
      routine is essentially building the catalog, and we don't restrict the
      files listed until we are ready to call open_mfdataset()
2. _open_history_files() -> gen_dataset(); this routine is where start_date and
   end_date are specified, and it returns a dataset rather than updating a
   class member variable. Logic to apply start_date and end_date at this stage
   is much simpler than doing it in _find_hist_files() (or I figured out a much
   easier way to do it). This routine also now expects a list of variable names
   to include in the dataset.
3. compare_fields_at_lat_lon expects array of DataArrays, not array of cases
4. Added get_varnames_from_metadata_list() to utils/utils.py because I need
   this function to generate the list of variable names for gen_dataset() in
   several notebooks.

I also created a script that submits jobs to the slurm queue to re-run
notebooks, though it does not work for the trend_maps or plot_suite_00[34]
notebooks (I think because of the reliance on dask and NCAR_jobqueue).
@mnlevy1981
Copy link
Contributor Author

Side note: I don't know why github thinks the plots in the notebook have changed. I suspect it's minor version differences in packages like matplotlib between my environment and Keith's? Now that my changes have been committed, I'll rerun some of the notebooks on master and verify that I see the same changes in the images.

gen_dataset() first reads time series from campaign, then reads any additional
data from history files (from archive or run directory). There's an API change
where, instead of expecting start_date and end_date (strings formatted as
'YYYY-MM'), the function wants integers start_year and end_year. Figuring out
better default values will be necessary to extend this to other CESM runs but
it's a good start.

I split plot_suite_004.ipynb into 4-year segments, as it was choking on 8 years
of output. I also added plot_suite_map notebooks for 004 for years 0006 and
0007 (we're one month shy of 0008).
@mnlevy1981 mnlevy1981 marked this pull request as ready for review October 6, 2020 17:06
@mnlevy1981
Copy link
Contributor Author

As of 4c43880 there's a simple interface for reading time series data off of /glade/campaign and what I think is a useful message when merging a dataset from time series files with one from history files:

Time series ends at 0007-01-01 00:00:00, history files begin at 0007-01-01 00:00:00

I ended up splitting plot_suite_004.ipynb into four year chunks because the code was stalling trying to process almost 8 years of data at once; I should probably add 0001-0004 to the file name (and to plot_suite_003.ipynb) to be consistent with the new plot_suite_0005-0008_004.ipynb file.

I wish it was clearer why git is claiming there are diffs in some of the plots, but a quick eyeball check didn't pick up on differences (outside of the change to just 48 months of data in plot_suite_004.ipynb). I'll talk about this PR at the software meeting this afternoon and schedule a more rigorous review in the next day or two.

I had left the "import yaml" statement in from when I toyed with the idea of
passing the metadata yaml file to __init__
Concatenating time series dataset with history file dataset now works as
expected
Also reran plot_suite notebooks in an environment that will match previous
master (note that this changes plots in Sanity Check). Next commit will be
updated trend_maps, and the plan is to squash-merge this PR to remove the
commits that changed the plots and then changed them back.
Re-ran in an environment that matches the previous master to minimize diffs in
the PR.
I didn't realize this notebook hadn't been run in previous commits
@mnlevy1981
Copy link
Contributor Author

@klindsay28 and I compared notes, and I thought we came up with an environment that would let me recreate the plots on master identically... but it seems to have worked for some notebooks and not others? The following shows how many times image appears in the git diff between master and the version of the notebook on this branch, split between - and +:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 0
+: 2
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 252
+: 252
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 96
+: 96
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 480
+: 480
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 470
+: 480
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

From this, it looks like the plots in the plot_suite_CASE and trend_maps notebooks were identical (hooray!). Notebooks like plot_suite_maps_0001_003.ipynb and plot_suite_maps_0002_003.ipynb) were missing a couple of images on master (perhaps they didn't finish running before the last commit?), but the images in common are identical. But then notebooks like plot_suite_maps_0001_004.ipynb and plot_suite_maps_0002_004.ipynb had some images match, but then many that don't. And the rest of the notebooks seem to have changed every image.

I'm at a loss to explain this behavior, since all the plot_suite_maps files were updated in the same commit (ca8458a). The seemingly obvious answer is that something happened in the time series generation process and the fields are no longer the same, but I confirmed that is not the case in three ways. First, the trend_maps and plot_suite_CASE notebooks are using time series data. Also, I directly compared one month of one of the plotted variables the time series values and the history values:

>>> import xarray as xr ; import numpy as np
>>> casename = 'g.e22.G1850ECO_JRA_HR.TL319_t13.004'
>>> year = '0003'
>>> varname = 'spChl'
>>> ds_hist = xr.open_dataset(f'/glade/scratch/mlevy/archive/{casename}/ocn/hist/{casename}.pop.h.{year}-01.nc')
>>> ds_ts = xr.open_dataset(f'/glade/campaign/cesm/development/bgcwg/projects/hi-res_JRA/cases/{casename}/output/ocn/proc/tseries/month_1/{casename}.pop.h.{varname}.{year}01-{year}12.nc')
>>> da_hist = ds_hist[varname].isel(time=0)
>>> da_ts = ds_ts[varname].isel(time=0)
>>> da_diff = da_hist - da_ts
>>> (da_diff.min().data, da_diff.max().data) # min and max both 0 if values match
(array(0.), array(0.))
>>> np.sum(np.isnan(da_diff) != np.isnan(da_ts)).data # if masks are identical, this will be 0
array(0)

Lastly, I re-ran the notebook, forcing it to use history output with the line case._timeseries_filenames[stream] = []:

plot_suite_maps_0003_004.ipynb (480 images)
----
-: 480
+: 480

To add to the strangeness, it looks like running with history files instead of time series modified 96 plots in my local sandbox (including January spChl that I included in the example above):

$ git diff | grep image | grep ^- | wc -l
96
$ git diff | grep image | grep ^+ | wc -l
96

So 384 plots are identical when my branch reads them from either time series or history files, but all 384 are different than what is on master. For the other 96 plots, there are differences between time series, history, AND master.

I think the path forward might involve

  1. accepting these changes from master even if we can't explain them, and
  2. adding a new notebook that verifies the time series fields match the history file output exactly via checks like what I included above.

@mnlevy1981
Copy link
Contributor Author

For plot_suite_maps_0003_004.ipynb it looks like I copy / pasted start_year=2, end_year=2 rather than getting year three data. Something similar is likely the case for the other notebooks that are showing every single plot as different (I'll double check after leaving this comment), though it doesn't explain the notebooks where only some of the plots are different.

@dcherian
Copy link

dcherian commented Oct 8, 2020

Are the images visibly different? https://app.reviewnb.com/ could be useful in this repo...

(I know that matplotlib commonly runs in to tiny pixel-level differences the font was rendered very slightly different.)

@mnlevy1981
Copy link
Contributor Author

Are the images visibly different? https://app.reviewnb.com/ could be useful in this repo...

When I first started making these notebooks, the plots looked the same in the eyeball norm but git claimed they were different. Then I tried to recreate the environment that @klindsay28 used and in the first few notebooks git stopped seeing changes. Then I refactored and got some of the years wrong, so I'm currently re-running with one more refactor that makes it harder to recreate that copy / paste error :) I'll take a closer look at the cases like

plot_suite_maps_0002_004.ipynb (480 images)
----
-: 96
+: 96

After re-running all the notebooks and see how the images compare in the eyeball norm.

@mnlevy1981
Copy link
Contributor Author

Re-running the script from before, it looks like getting the year correct makes a pretty big difference:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 154
+: 156
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 156
+: 156
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 156
+: 156
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 96
+: 96
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 0
+: 10
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

It's curious to me that the notebooks showing differences in plots all seem to have 324 plots that match and 156 that don't (plot_suite_maps_0004_004.ipynb is an exception to this). If I'm reading the diffs right, it's the NH4 plots and all that follow (whereas for plot_suite_maps_0004_004.ipynb the first different plot is spChl and then there are some matches afterwards?). So I'll see if anything stands out in NH4 on either my branch or the current master.

In the meantime, I'll push notebooks that use da.indentical() to show that the history files and time series files do match (at least for the fields we are plotting)

Some of these were plotting the wrong year; I refactored all of them to be more
clear about what is being plotted.
@mnlevy1981
Copy link
Contributor Author

Okay, I think the mystery might be solved... compare these two images; the first is a plot from master, the second is a plot from my branch:

NH4_0003_003_master
NH4_0003_003_branch

(hint: look at the title) I have no idea what would cause these strings to flip position, and the weird part is that other plots from 3D fields (e.g. spChl) in the same notebook are the same and match the order of terms in NH4 on master:

spChl_0003_003_master
spChl_0003_003_branch

So I'm not going to compare 562 plots that have changed, but I suspect they only vary in the order that time and depth appear in the title.

For now the comparisons are done based on variables in diag_metadata.yaml.
Required a flag to suppress some CaseClass output.
@mnlevy1981
Copy link
Contributor Author

@dcherian -- any idea why _title_for_slice() would give different results on successive calls? Can I specify an order for da.coords at some point in the process or does it need to be determined automatically? I can recreate the issue because

import utils

casename = "g.e22.G1850ECO_JRA_HR.TL319_t13.003"
case = utils.CaseClass(casename)
ds = case.gen_dataset('NH4', 'pop.h', start_year=1, end_year=1)

one_dims=[]
for dim, coord in ds['NH4'].isel(z_t=0, time=0).coords.items():
    if coord.size == 1:
        one_dims.append(
            "{dim} = {v}".format(dim=dim, v=coord.values)
        )

print(one_dims)

sometimes prints['time = 0001-01-16 12:15:50', 'z_t = 500.0'] and other times prints ['z_t = 500.0', 'time = 0001-01-16 12:15:50']

All plot_suite_maps now match master
@mnlevy1981
Copy link
Contributor Author

Well, I updated xarray to 16.1 (from 16.0), and things seem to match master now:

plot_suite_maps_0001_003.ipynb (480 images)
----
-: 0
+: 5
plot_suite_maps_0002_003.ipynb (480 images)
----
-: 0
+: 2
plot_suite_maps_0003_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0004_003.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0001_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0002_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0003_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0004_004.ipynb (480 images)
----
-: 0
+: 0
plot_suite_maps_0005_004.ipynb (480 images)
----
-: 0
+: 10
plot_suite_003.ipynb (184 images)
----
-: 0
+: 0
trend_maps.003.ipynb (60 images)
----
-: 0
+: 0
plot_suite_004.ipynb (224 images)
----
-: 0
+: 0
trend_maps.004.ipynb (60 images)
----
-: 0
+: 0

I upgraded because I thought that pydata/xarray#4409 looked promising (although it's not clear to me when coords were getting shuffled, given I was using open_mfdataset to read a single time series file and not doing any merging or concatenating) and as of 53c1f9f the diffs of the .ipynb files are meaningful.

@klindsay28
Copy link
Contributor

@mnlevy1981 , statements of the form ds_subset = ds[varnamelist] can trigger the non-determinism described in pydata/xarray#4409. See my comment at link.

I think that you are doing this with code like xr.open_mfdataset(...)[[varname] + _vars_to_keep].

@mnlevy1981
Copy link
Contributor Author

@klindsay28 Thanks for pointing that out! I thought I had read the entire PR conversation, but I must have gone through one of the issue tickets addressed by the PR instead. I'm a little confused about why your plot titles seem to be consistent despite the comment you linked to - did you end up using pip to install xarray from github after this PR was merged? Maybe conda wasn't smart enough to realize that and dumped the old version when you ran conda export. Not a pressing matter, I think it's just the last step in figuring out how our environments differed.

@dcherian
Copy link

dcherian commented Oct 9, 2020

glad this is fixed.

* run_notebook.sh launches a single notebook via slurm, and run_all.sh is now
  run_all.py (python-script) that calls run_notebook.sh
* _find_timeseries_files() now includes pop.h.nyear1 files
* gen_dataset():
  1. no longer assume time_bound is the name of the variable (look at bounds
     property of ds["time"]
  2. if bounds are not decoded, still update start_year before looking for
     history files (via decoding time, which may not be best solution)
* use list.extend() instead of += operator
@mnlevy1981 mnlevy1981 merged commit a58369e into marbl-ecosys:master Oct 9, 2020
@mnlevy1981 mnlevy1981 deleted the time-series-API branch November 20, 2020 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants