Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

spencerkclark · 2019-10-10T14:38:31Z

In v.12 of xarray, combined_by_coords behaves differently with DataArrays and Datasets. It was broadcasting all the variables to a common set of dimensions, dramatically increasing the size of the dataset.

This behavior caught me a bit by surprise too. It turns out a clean way to avoid it is by specifying data_vars='minimal' in combine_by_coords (and by extension open_mfdataset). This PR adds a test for remove_duplicate_coords and simplifies open_cubed_sphere by taking advantage of data_vars='minimal' when combining.

I'll try and think about what the best way of testing open_cubed_sphere might be, considering that it depends on the existence of files to read. Would we maybe want to include some minimal dummy data in the repo?

Anyway I hope I'm not stepping on work you've already done or planned to do here; I realize I'm stepping a bit out of scope of what you suggested on Trello. I'll get to writing functions to re-coarse-grain data now.

nbren12 · 2019-10-10T15:09:42Z

I'll try and think about what the best way of testing open_cubed_sphere might be, considering that it depends on the existence of files to read.

I think we should avoid adding data to the repo unless necessary. We will probably want to write a function at some point that sub-tiles the faces of the cubed sphere and writes them to disk. We could then generate some fake data using numpy commands and save it to disk in the same format.

nbren12 · 2019-10-10T15:14:06Z

src/data/cubedsphere.py

+    tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')
+    tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
+    combined = xr.concat(tiles, dim=tile_index)
+    return remove_duplicate_coords(combined)


I would be in favor of calling remove_duplicate_coords in read_tile since that is where the problem arises.

Sure thing -- I'll change this in my next commit.

nbren12 · 2019-10-10T15:15:10Z

Also, if your code works, I think we can delay adding more comprehensive tests until a future PR.

nbren12 · 2019-10-10T15:20:43Z

src/data/cubedsphere.py

+# TODO(Spencer): write a test of this function
+def read_tile(prefix, tile, num_subtiles=16):
+    subtiles = range(num_subtiles)
+    filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles]


Could we make this filename pattern a keyword argument? I get a little worried about having hardcoded strings too far down the call stack.

Are you suggesting something like this?

def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16): subtiles = range(num_subtiles) filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles] ...

Exactly, it's not perfect since I would prefer the file name handling code be completely separate from the combining code, but it is okay for now.

spencerkclark

Sounds good regarding waiting until later to write comprehensive tests for open_cubed_sphere. Yes, this code should work the same way as your old implementation.

spencerkclark · 2019-10-10T15:20:11Z

src/data/cubedsphere.py

-
-    data = xr.concat(combined_tiles, dim='tiles').sortby('tiles')
-    return remove_duplicate_coords(data)
+    tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')


I have a slight preference for a singular dimension name here, i.e. 'tile' instead of 'tiles'. Do you mind if I change that?

yes please! Although there are some downsteam dependecies in the Snakefile, src.fv3.docker and src.fv3.coarsen.

Great, thanks for the heads up -- hopefully I got to everything in 0c03f91.

spencerkclark · 2019-10-10T15:20:32Z

src/data/cubedsphere.py

+    tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')
+    tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
+    combined = xr.concat(tiles, dim=tile_index)
+    return remove_duplicate_coords(combined)


Sure thing -- I'll change this in my next commit.

spencerkclark · 2019-10-10T15:28:23Z

src/data/cubedsphere.py

+# TODO(Spencer): write a test of this function
+def read_tile(prefix, tile, num_subtiles=16):
+    subtiles = range(num_subtiles)
+    filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles]


Are you suggesting something like this?

def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16): subtiles = range(num_subtiles) filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles] ...

spencerkclark · 2019-10-10T21:43:48Z

Hmm...I'm having second thoughts about my approach here. I was testing on the fv_core_coarse.res restart files, and things were working well. Testing on the fine resolution grid_spec files seems to fail (your implementation works; it is also faster by a factor of 30, because I think xarray does some equality checking of variables it combines together in open_mfdataset under the hood):

---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
<ipython-input-117-d00ca2651d89> in <module>
----> 1 grid_spec = open_cubed_sphere('grid_spec')

<ipython-input-116-1eedeaf547e4> in open_cubed_sphere(prefix, **kwargs)
     22 def open_cubed_sphere(prefix, **kwargs):
     23     tile_index = pd.Index(range(1, 7), name='tile')
---> 24     tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
     25     combined = xr.concat(tiles, dim=tile_index)
     26     return remove_duplicate_coords(combined)

<ipython-input-116-1eedeaf547e4> in <listcomp>(.0)
     22 def open_cubed_sphere(prefix, **kwargs):
     23     tile_index = pd.Index(range(1, 7), name='tile')
---> 24     tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
     25     combined = xr.concat(tiles, dim=tile_index)
     26     return remove_duplicate_coords(combined)

<ipython-input-116-1eedeaf547e4> in read_tile(prefix, tile, num_subtiles)
     17         filenames,
     18         data_vars='minimal',
---> 19         combine='by_coords')
     20 
     21 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, join, **kwargs)
    950             # previously
    951             combined = combine_by_coords(
--> 952                 datasets, compat=compat, data_vars=data_vars, coords=coords, join=join
    953             )
    954         else:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join)
    604             compat=compat,
    605             fill_value=fill_value,
--> 606             join=join,
    607         )
    608 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join)
    196             compat=compat,
    197             fill_value=fill_value,
--> 198             join=join,
    199         )
    200     (combined_ds,) = combined_ids.values()

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join)
    218         datasets = combined_ids.values()
    219         new_combined_ids[new_id] = _combine_1d(
--> 220             datasets, dim, compat, data_vars, coords, fill_value, join
    221         )
    222     return new_combined_ids

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
    246                 compat=compat,
    247                 fill_value=fill_value,
--> 248                 join=join,
    249             )
    250         except ValueError as err:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    131             "objects, got %s" % type(first_obj)
    132         )
--> 133     return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    134 
    135 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join)
    329         for var in variables_to_merge:
    330             result_vars[var] = unique_variable(
--> 331                 var, to_merge[var], compat=compat, equals=equals.get(var, None)
    332             )
    333     else:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/merge.py in unique_variable(name, variables, compat, equals)
    123         raise MergeError(
    124             "conflicting values for variable %r on objects to be combined. You can skip this check by specifying compat='override'."
--> 125             % (name)
    126         )
    127 

MergeError: conflicting values for variable 'dy' on objects to be combined. You can skip this check by specifying compat='override'.

I think this is related to the fact that dy currently has missing values in it; it really shouldn't, but that's a separate issue (see my comment on Trello here). I now know how to do this appropriately with FMS, so going forward it won't be a problem.

So, ultimately it would be nice if we could use as simple a solution as I proposed, but perhaps some complexity here is unavoidable if we want good performance and robustness to missing values in variables to be combined.

nbren12 · 2019-10-10T21:54:14Z

Okay. that's too bad. Honestly, I have a big aversion to open_mfdataset. It never seems to work that well, and it's options are very confusing.

spencerkclark · 2019-10-11T13:36:37Z

Honestly, I have a big aversion to open_mfdataset. It never seems to work that well, and it's options are very confusing.

Yeah, I was excited to try out the new multidimensional combine option here (this felt like the ideal use-case), but it seems like that functionality is still a bit raw, indicated particularly by the multiple workarounds needed to get combine_by_coords to work as desired. At least there is an open xarray PR for improving the situation regarding combine_by_coords and DataArrays: pydata/xarray#3312. It also feels like there could be room for another data_vars option that concatenates all variables without doing equality checks or broadcasting. At the very least, I think the documentation could be improved regarding those options. They are indeed confusing.

I reverted back to not using open_mfdataset and refactored things a little. Essentially I switched to using a method to generate the subtile filenames for a single tile, rather than all the tiles at once; this avoids the need for grouping the filenames by tile after the fact. Up to you whether you want to go with that; no worries if you prefer the original implementation.

nbren12 · 2019-10-12T16:53:35Z

Okay. This looks pretty good. I will try to run the workflow and merge it if it works.

begin rant:

At the very least, I think the documentation could be improved regarding those options.

I think the problem with open_mfdataset goes a little deeper. It just does way too much (e.g. match the filenames with a glob, open the datasets one-by-one, merge them) for one function and as a result it's argument space is heading toward combinatorial complexity trying to handle all these cases.

That said I think there are some special cases that should be supported, such as a directory of files, 1 timestep per file, identical dimensions, where the name of the file contains the date/time.

nbren12 · 2019-10-16T13:44:51Z

I am going to merge this. Thanks spencer!

Add test of remove_duplicate_coords; simplify open_cubed_sphere

1eaf5fd

nbren12 suggested changes Oct 10, 2019

View reviewed changes

nbren12 reviewed Oct 10, 2019

View reviewed changes

spencerkclark commented Oct 10, 2019

View reviewed changes

Review comments; tiles -> tile

0c03f91

Revert back to not using open_mfdataset; refactor/polish things a little

1659d41

Retain docstring for open_cubed_sphere

347d25f

nbren12 merged commit f1ad046 into ai2cm:feature/post-process-timesteps Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

spencerkclark commented Oct 10, 2019 •

edited

Loading

nbren12 commented Oct 10, 2019

nbren12 Oct 10, 2019

spencerkclark Oct 10, 2019

nbren12 commented Oct 10, 2019

nbren12 Oct 10, 2019

spencerkclark Oct 10, 2019

nbren12 Oct 10, 2019

spencerkclark left a comment

spencerkclark Oct 10, 2019

nbren12 Oct 10, 2019

spencerkclark Oct 10, 2019

spencerkclark Oct 10, 2019

spencerkclark Oct 10, 2019

spencerkclark commented Oct 10, 2019

nbren12 commented Oct 10, 2019

spencerkclark commented Oct 11, 2019

nbren12 commented Oct 12, 2019 •

edited

Loading

nbren12 commented Oct 16, 2019

Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

Conversation

spencerkclark commented Oct 10, 2019 • edited Loading

nbren12 commented Oct 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 commented Oct 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerkclark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerkclark commented Oct 10, 2019

nbren12 commented Oct 10, 2019

spencerkclark commented Oct 11, 2019

nbren12 commented Oct 12, 2019 • edited Loading

nbren12 commented Oct 16, 2019

spencerkclark commented Oct 10, 2019 •

edited

Loading

nbren12 commented Oct 12, 2019 •

edited

Loading