-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak to to_dask_dataframe() #1667
Conversation
- Add a `dim_order` argument - Always write columns for each dimension - Docstring to NumPy format
|
||
# ensure all variables have the same chunking structure | ||
if v.chunks != chunks: | ||
v = v.chunk(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmunroe was there was reason why you didn't just chunk everything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the assumption (probably mistaken) that there was a cost to calling .chunk(chunks) on a variable that already had that chunking structure. If that assumption was not correct, then, yes, everything could just be chunked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rechunk is nearly free if chunks are unchanged -- it actually returns the same dask array object.
@jhamman can you please take a look here when you have the chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor comments
if isinstance(v, xr.IndexVariable): | ||
v = v.to_base_variable() | ||
if dim_order is None: | ||
dim_order = list(self.dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to remember but is this always a sorted tuple/dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Dataset, it's a SortedKeysDict (i.e., the dimensions in alphabetical order).
xarray/core/dataset.py
Outdated
var = self.variables[name] | ||
except KeyError: | ||
# dimension without a matching coordinate | ||
values = np.arange(self.dims[name], dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we initialize this as a dask array to avoid creating the array when it will not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea, will do
ds['y'] = ('y', list('abc')) | ||
|
||
expected = ds.compute().to_dataframe() | ||
actual = ds.to_dask_dataframe(set_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than using xfail
above, use raises_regex
to make sure we raise an error in the correct line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning on using xfail was that that makes this test more robust. If/when dask implementing MultiIndex, we'll just get an unexpected xfail rather than a failing test. NotImplementedError
seems specific enough (unlike, e.g., ValueError) that I'm not concerned about grepping for the exact error message.
Follow on to #1489.
Add a
dim_order
argumentAlways write columns for each dimension
Docstring to NumPy format
Tests added / passed
Passes
git diff upstream/master **/*py | flake8 --diff
Fully documented, including
whats-new.rst
for all changes andapi.rst
for new APICC @jmunroe @jhamman