-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update API for CaseClass #30
Conversation
This first commit is done in anticipation of supporting time series files in addition to history files, and also using intake-esm catalogs behind the scenes. 1. start_date & end_date are not part of the class constructor -- basically all the __init__ routine should do is set up a catalog -- _find_hist_files() also doesn't rely on these variables either; this routine is essentially building the catalog, and we don't restrict the files listed until we are ready to call open_mfdataset() 2. _open_history_files() -> gen_dataset(); this routine is where start_date and end_date are specified, and it returns a dataset rather than updating a class member variable. Logic to apply start_date and end_date at this stage is much simpler than doing it in _find_hist_files() (or I figured out a much easier way to do it). This routine also now expects a list of variable names to include in the dataset. 3. compare_fields_at_lat_lon expects array of DataArrays, not array of cases 4. Added get_varnames_from_metadata_list() to utils/utils.py because I need this function to generate the list of variable names for gen_dataset() in several notebooks. I also created a script that submits jobs to the slurm queue to re-run notebooks, though it does not work for the trend_maps or plot_suite_00[34] notebooks (I think because of the reliance on dask and NCAR_jobqueue).
Side note: I don't know why github thinks the plots in the notebook have changed. I suspect it's minor version differences in packages like |
gen_dataset() first reads time series from campaign, then reads any additional data from history files (from archive or run directory). There's an API change where, instead of expecting start_date and end_date (strings formatted as 'YYYY-MM'), the function wants integers start_year and end_year. Figuring out better default values will be necessary to extend this to other CESM runs but it's a good start. I split plot_suite_004.ipynb into 4-year segments, as it was choking on 8 years of output. I also added plot_suite_map notebooks for 004 for years 0006 and 0007 (we're one month shy of 0008).
As of 4c43880 there's a simple interface for reading time series data off of
I ended up splitting I wish it was clearer why git is claiming there are diffs in some of the plots, but a quick eyeball check didn't pick up on differences (outside of the change to just 48 months of data in |
I had left the "import yaml" statement in from when I toyed with the idea of passing the metadata yaml file to __init__
Concatenating time series dataset with history file dataset now works as expected
Also reran plot_suite notebooks in an environment that will match previous master (note that this changes plots in Sanity Check). Next commit will be updated trend_maps, and the plan is to squash-merge this PR to remove the commits that changed the plots and then changed them back.
Re-ran in an environment that matches the previous master to minimize diffs in the PR.
I didn't realize this notebook hadn't been run in previous commits
@klindsay28 and I compared notes, and I thought we came up with an environment that would let me recreate the plots on
From this, it looks like the plots in the I'm at a loss to explain this behavior, since all the >>> import xarray as xr ; import numpy as np
>>> casename = 'g.e22.G1850ECO_JRA_HR.TL319_t13.004'
>>> year = '0003'
>>> varname = 'spChl'
>>> ds_hist = xr.open_dataset(f'/glade/scratch/mlevy/archive/{casename}/ocn/hist/{casename}.pop.h.{year}-01.nc')
>>> ds_ts = xr.open_dataset(f'/glade/campaign/cesm/development/bgcwg/projects/hi-res_JRA/cases/{casename}/output/ocn/proc/tseries/month_1/{casename}.pop.h.{varname}.{year}01-{year}12.nc')
>>> da_hist = ds_hist[varname].isel(time=0)
>>> da_ts = ds_ts[varname].isel(time=0)
>>> da_diff = da_hist - da_ts
>>> (da_diff.min().data, da_diff.max().data) # min and max both 0 if values match
(array(0.), array(0.))
>>> np.sum(np.isnan(da_diff) != np.isnan(da_ts)).data # if masks are identical, this will be 0
array(0) Lastly, I re-ran the notebook, forcing it to use history output with the line
To add to the strangeness, it looks like running with history files instead of time series modified 96 plots in my local sandbox (including January
So 384 plots are identical when my branch reads them from either time series or history files, but all 384 are different than what is on I think the path forward might involve
|
For |
Are the images visibly different? https://app.reviewnb.com/ could be useful in this repo... (I know that matplotlib commonly runs in to tiny pixel-level differences the font was rendered very slightly different.) |
When I first started making these notebooks, the plots looked the same in the eyeball norm but
After re-running all the notebooks and see how the images compare in the eyeball norm. |
Re-running the script from before, it looks like getting the year correct makes a pretty big difference:
It's curious to me that the notebooks showing differences in plots all seem to have 324 plots that match and 156 that don't ( In the meantime, I'll push notebooks that use |
Some of these were plotting the wrong year; I refactored all of them to be more clear about what is being plotted.
Okay, I think the mystery might be solved... compare these two images; the first is a plot from (hint: look at the title) I have no idea what would cause these strings to flip position, and the weird part is that other plots from 3D fields (e.g. So I'm not going to compare 562 plots that have changed, but I suspect they only vary in the order that time and depth appear in the title. |
For now the comparisons are done based on variables in diag_metadata.yaml. Required a flag to suppress some CaseClass output.
@dcherian -- any idea why import utils
casename = "g.e22.G1850ECO_JRA_HR.TL319_t13.003"
case = utils.CaseClass(casename)
ds = case.gen_dataset('NH4', 'pop.h', start_year=1, end_year=1)
one_dims=[]
for dim, coord in ds['NH4'].isel(z_t=0, time=0).coords.items():
if coord.size == 1:
one_dims.append(
"{dim} = {v}".format(dim=dim, v=coord.values)
)
print(one_dims) sometimes prints |
All plot_suite_maps now match master
Well, I updated
I upgraded because I thought that pydata/xarray#4409 looked promising (although it's not clear to me when |
@mnlevy1981 , statements of the form I think that you are doing this with code like |
@klindsay28 Thanks for pointing that out! I thought I had read the entire PR conversation, but I must have gone through one of the issue tickets addressed by the PR instead. I'm a little confused about why your plot titles seem to be consistent despite the comment you linked to - did you end up using |
glad this is fixed. |
* run_notebook.sh launches a single notebook via slurm, and run_all.sh is now run_all.py (python-script) that calls run_notebook.sh * _find_timeseries_files() now includes pop.h.nyear1 files * gen_dataset(): 1. no longer assume time_bound is the name of the variable (look at bounds property of ds["time"] 2. if bounds are not decoded, still update start_year before looking for history files (via decoding time, which may not be best solution) * use list.extend() instead of += operator
The goal for this PR is to expand the API of
CaseClass
to take advantage of the fact that time-series data is now available on campaign. It would be great if this included support forintake-esm
catalogs, but the first pass might not be that sophisticated.I hope the first commit (361cd79) contains all the actual API changes; I think I'll be able to add the rest of the required functionality deeper in
CaseClass
.