Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test of remove_duplicate_coords; simplify open_cubed_sphere #13

Merged

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Oct 10, 2019

In v.12 of xarray, combined_by_coords behaves differently with DataArrays and Datasets. It was broadcasting all the variables to a common set of dimensions, dramatically increasing the size of the dataset.

This behavior caught me a bit by surprise too. It turns out a clean way to avoid it is by specifying data_vars='minimal' in combine_by_coords (and by extension open_mfdataset). This PR adds a test for remove_duplicate_coords and simplifies open_cubed_sphere by taking advantage of data_vars='minimal' when combining.

I'll try and think about what the best way of testing open_cubed_sphere might be, considering that it depends on the existence of files to read. Would we maybe want to include some minimal dummy data in the repo?

Anyway I hope I'm not stepping on work you've already done or planned to do here; I realize I'm stepping a bit out of scope of what you suggested on Trello. I'll get to writing functions to re-coarse-grain data now.

@nbren12
Copy link
Contributor

nbren12 commented Oct 10, 2019

I'll try and think about what the best way of testing open_cubed_sphere might be, considering that it depends on the existence of files to read.

I think we should avoid adding data to the repo unless necessary. We will probably want to write a function at some point that sub-tiles the faces of the cubed sphere and writes them to disk. We could then generate some fake data using numpy commands and save it to disk in the same format.

tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')
tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
combined = xr.concat(tiles, dim=tile_index)
return remove_duplicate_coords(combined)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of calling remove_duplicate_coords in read_tile since that is where the problem arises.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing -- I'll change this in my next commit.

@nbren12
Copy link
Contributor

nbren12 commented Oct 10, 2019

Also, if your code works, I think we can delay adding more comprehensive tests until a future PR.

# TODO(Spencer): write a test of this function
def read_tile(prefix, tile, num_subtiles=16):
subtiles = range(num_subtiles)
filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this filename pattern a keyword argument? I get a little worried about having hardcoded strings too far down the call stack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like this?

def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16):
    subtiles = range(num_subtiles)
    filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles]
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it's not perfect since I would prefer the file name handling code be completely separate from the combining code, but it is okay for now.

Copy link
Member Author

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good regarding waiting until later to write comprehensive tests for open_cubed_sphere. Yes, this code should work the same way as your old implementation.


data = xr.concat(combined_tiles, dim='tiles').sortby('tiles')
return remove_duplicate_coords(data)
tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for a singular dimension name here, i.e. 'tile' instead of 'tiles'. Do you mind if I change that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please! Although there are some downsteam dependecies in the Snakefile, src.fv3.docker and src.fv3.coarsen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the heads up -- hopefully I got to everything in 0c03f91.

tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles')
tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
combined = xr.concat(tiles, dim=tile_index)
return remove_duplicate_coords(combined)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing -- I'll change this in my next commit.

# TODO(Spencer): write a test of this function
def read_tile(prefix, tile, num_subtiles=16):
subtiles = range(num_subtiles)
filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like this?

def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16):
    subtiles = range(num_subtiles)
    filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles]
    ...

@spencerkclark
Copy link
Member Author

Hmm...I'm having second thoughts about my approach here. I was testing on the fv_core_coarse.res restart files, and things were working well. Testing on the fine resolution grid_spec files seems to fail (your implementation works; it is also faster by a factor of 30, because I think xarray does some equality checking of variables it combines together in open_mfdataset under the hood):

---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
<ipython-input-117-d00ca2651d89> in <module>
----> 1 grid_spec = open_cubed_sphere('grid_spec')

<ipython-input-116-1eedeaf547e4> in open_cubed_sphere(prefix, **kwargs)
     22 def open_cubed_sphere(prefix, **kwargs):
     23     tile_index = pd.Index(range(1, 7), name='tile')
---> 24     tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
     25     combined = xr.concat(tiles, dim=tile_index)
     26     return remove_duplicate_coords(combined)

<ipython-input-116-1eedeaf547e4> in <listcomp>(.0)
     22 def open_cubed_sphere(prefix, **kwargs):
     23     tile_index = pd.Index(range(1, 7), name='tile')
---> 24     tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index]
     25     combined = xr.concat(tiles, dim=tile_index)
     26     return remove_duplicate_coords(combined)

<ipython-input-116-1eedeaf547e4> in read_tile(prefix, tile, num_subtiles)
     17         filenames,
     18         data_vars='minimal',
---> 19         combine='by_coords')
     20 
     21 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, join, **kwargs)
    950             # previously
    951             combined = combine_by_coords(
--> 952                 datasets, compat=compat, data_vars=data_vars, coords=coords, join=join
    953             )
    954         else:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join)
    604             compat=compat,
    605             fill_value=fill_value,
--> 606             join=join,
    607         )
    608 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join)
    196             compat=compat,
    197             fill_value=fill_value,
--> 198             join=join,
    199         )
    200     (combined_ds,) = combined_ids.values()

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join)
    218         datasets = combined_ids.values()
    219         new_combined_ids[new_id] = _combine_1d(
--> 220             datasets, dim, compat, data_vars, coords, fill_value, join
    221         )
    222     return new_combined_ids

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
    246                 compat=compat,
    247                 fill_value=fill_value,
--> 248                 join=join,
    249             )
    250         except ValueError as err:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    131             "objects, got %s" % type(first_obj)
    132         )
--> 133     return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    134 
    135 

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join)
    329         for var in variables_to_merge:
    330             result_vars[var] = unique_variable(
--> 331                 var, to_merge[var], compat=compat, equals=equals.get(var, None)
    332             )
    333     else:

~/Software/miniconda3/envs/fv3net/lib/python3.7/site-packages/xarray/core/merge.py in unique_variable(name, variables, compat, equals)
    123         raise MergeError(
    124             "conflicting values for variable %r on objects to be combined. You can skip this check by specifying compat='override'."
--> 125             % (name)
    126         )
    127 

MergeError: conflicting values for variable 'dy' on objects to be combined. You can skip this check by specifying compat='override'.

I think this is related to the fact that dy currently has missing values in it; it really shouldn't, but that's a separate issue (see my comment on Trello here). I now know how to do this appropriately with FMS, so going forward it won't be a problem.

So, ultimately it would be nice if we could use as simple a solution as I proposed, but perhaps some complexity here is unavoidable if we want good performance and robustness to missing values in variables to be combined.

@nbren12
Copy link
Contributor

nbren12 commented Oct 10, 2019

Okay. that's too bad. Honestly, I have a big aversion to open_mfdataset. It never seems to work that well, and it's options are very confusing.

@spencerkclark
Copy link
Member Author

Honestly, I have a big aversion to open_mfdataset. It never seems to work that well, and it's options are very confusing.

Yeah, I was excited to try out the new multidimensional combine option here (this felt like the ideal use-case), but it seems like that functionality is still a bit raw, indicated particularly by the multiple workarounds needed to get combine_by_coords to work as desired. At least there is an open xarray PR for improving the situation regarding combine_by_coords and DataArrays: pydata/xarray#3312. It also feels like there could be room for another data_vars option that concatenates all variables without doing equality checks or broadcasting. At the very least, I think the documentation could be improved regarding those options. They are indeed confusing.

I reverted back to not using open_mfdataset and refactored things a little. Essentially I switched to using a method to generate the subtile filenames for a single tile, rather than all the tiles at once; this avoids the need for grouping the filenames by tile after the fact. Up to you whether you want to go with that; no worries if you prefer the original implementation.

@nbren12
Copy link
Contributor

nbren12 commented Oct 12, 2019

Okay. This looks pretty good. I will try to run the workflow and merge it if it works.

begin rant:

At the very least, I think the documentation could be improved regarding those options.

I think the problem with open_mfdataset goes a little deeper. It just does way too much (e.g. match the filenames with a glob, open the datasets one-by-one, merge them) for one function and as a result it's argument space is heading toward combinatorial complexity trying to handle all these cases.

That said I think there are some special cases that should be supported, such as a directory of files, 1 timestep per file, identical dimensions, where the name of the file contains the date/time.

@nbren12
Copy link
Contributor

nbren12 commented Oct 16, 2019

I am going to merge this. Thanks spencer!

@nbren12 nbren12 merged commit f1ad046 into ai2cm:feature/post-process-timesteps Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants