-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test of remove_duplicate_coords; simplify open_cubed_sphere #13
Add test of remove_duplicate_coords; simplify open_cubed_sphere #13
Conversation
I think we should avoid adding data to the repo unless necessary. We will probably want to write a function at some point that sub-tiles the faces of the cubed sphere and writes them to disk. We could then generate some fake data using numpy commands and save it to disk in the same format. |
src/data/cubedsphere.py
Outdated
tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles') | ||
tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index] | ||
combined = xr.concat(tiles, dim=tile_index) | ||
return remove_duplicate_coords(combined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of calling remove_duplicate_coords
in read_tile
since that is where the problem arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing -- I'll change this in my next commit.
Also, if your code works, I think we can delay adding more comprehensive tests until a future PR. |
src/data/cubedsphere.py
Outdated
# TODO(Spencer): write a test of this function | ||
def read_tile(prefix, tile, num_subtiles=16): | ||
subtiles = range(num_subtiles) | ||
filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this filename pattern a keyword argument? I get a little worried about having hardcoded strings too far down the call stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like this?
def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16):
subtiles = range(num_subtiles)
filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles]
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it's not perfect since I would prefer the file name handling code be completely separate from the combining code, but it is okay for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good regarding waiting until later to write comprehensive tests for open_cubed_sphere
. Yes, this code should work the same way as your old implementation.
src/data/cubedsphere.py
Outdated
|
||
data = xr.concat(combined_tiles, dim='tiles').sortby('tiles') | ||
return remove_duplicate_coords(data) | ||
tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for a singular dimension name here, i.e. 'tile'
instead of 'tiles'
. Do you mind if I change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please! Although there are some downsteam dependecies in the Snakefile, src.fv3.docker
and src.fv3.coarsen
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the heads up -- hopefully I got to everything in 0c03f91.
src/data/cubedsphere.py
Outdated
tile_index = pd.Index(range(1, NUM_TILES + 1), name='tiles') | ||
tiles = [read_tile(prefix, tile, **kwargs) for tile in tile_index] | ||
combined = xr.concat(tiles, dim=tile_index) | ||
return remove_duplicate_coords(combined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing -- I'll change this in my next commit.
src/data/cubedsphere.py
Outdated
# TODO(Spencer): write a test of this function | ||
def read_tile(prefix, tile, num_subtiles=16): | ||
subtiles = range(num_subtiles) | ||
filenames = [f'{prefix}.tile{tile:d}.nc.{proc:04d}' for proc in subtiles] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like this?
def read_tile(prefix, tile, pattern='{prefix}.tile{tile:d}.nc.{proc:04d}', num_subtiles=16):
subtiles = range(num_subtiles)
filenames = [pattern.format(prefix=prefix, tile=tile, proc=proc) for proc in subtiles]
...
Hmm...I'm having second thoughts about my approach here. I was testing on the
I think this is related to the fact that So, ultimately it would be nice if we could use as simple a solution as I proposed, but perhaps some complexity here is unavoidable if we want good performance and robustness to missing values in variables to be combined. |
Okay. that's too bad. Honestly, I have a big aversion to |
Yeah, I was excited to try out the new multidimensional combine option here (this felt like the ideal use-case), but it seems like that functionality is still a bit raw, indicated particularly by the multiple workarounds needed to get I reverted back to not using |
Okay. This looks pretty good. I will try to run the workflow and merge it if it works. begin rant:
I think the problem with That said I think there are some special cases that should be supported, such as a directory of files, 1 timestep per file, identical dimensions, where the name of the file contains the date/time. |
I am going to merge this. Thanks spencer! |
This behavior caught me a bit by surprise too. It turns out a clean way to avoid it is by specifying
data_vars='minimal'
incombine_by_coords
(and by extensionopen_mfdataset
). This PR adds a test forremove_duplicate_coords
and simplifiesopen_cubed_sphere
by taking advantage ofdata_vars='minimal'
when combining.I'll try and think about what the best way of testing
open_cubed_sphere
might be, considering that it depends on the existence of files to read. Would we maybe want to include some minimal dummy data in the repo?Anyway I hope I'm not stepping on work you've already done or planned to do here; I realize I'm stepping a bit out of scope of what you suggested on Trello. I'll get to writing functions to re-coarse-grain data now.