Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak to to_dask_dataframe() #1667

Merged
merged 6 commits into from
Oct 31, 2017
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 28, 2017

Follow on to #1489.

  • Add a dim_order argument

  • Always write columns for each dimension

  • Docstring to NumPy format

  • Tests added / passed

  • Passes git diff upstream/master **/*py | flake8 --diff

  • Fully documented, including whats-new.rst for all changes and api.rst for new API

CC @jmunroe @jhamman

- Add a `dim_order` argument
- Always write columns for each dimension
- Docstring to NumPy format
@shoyer shoyer requested a review from jhamman October 28, 2017 17:35

# ensure all variables have the same chunking structure
if v.chunks != chunks:
v = v.chunk(chunks)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmunroe was there was reason why you didn't just chunk everything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the assumption (probably mistaken) that there was a cost to calling .chunk(chunks) on a variable that already had that chunking structure. If that assumption was not correct, then, yes, everything could just be chunked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechunk is nearly free if chunks are unchanged -- it actually returns the same dask array object.

@shoyer shoyer mentioned this pull request Oct 29, 2017
13 tasks
@shoyer
Copy link
Member Author

shoyer commented Oct 30, 2017

@jhamman can you please take a look here when you have the chance?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments

if isinstance(v, xr.IndexVariable):
v = v.to_base_variable()
if dim_order is None:
dim_order = list(self.dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to remember but is this always a sorted tuple/dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Dataset, it's a SortedKeysDict (i.e., the dimensions in alphabetical order).

var = self.variables[name]
except KeyError:
# dimension without a matching coordinate
values = np.arange(self.dims[name], dtype=np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we initialize this as a dask array to avoid creating the array when it will not be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea, will do

ds['y'] = ('y', list('abc'))

expected = ds.compute().to_dataframe()
actual = ds.to_dask_dataframe(set_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using xfail above, use raises_regex to make sure we raise an error in the correct line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning on using xfail was that that makes this test more robust. If/when dask implementing MultiIndex, we'll just get an unexpected xfail rather than a failing test. NotImplementedError seems specific enough (unlike, e.g., ValueError) that I'm not concerned about grepping for the exact error message.

@jhamman jhamman merged commit 7e9193c into pydata:master Oct 31, 2017
@shoyer shoyer deleted the dask_dataframe_tweak branch October 31, 2017 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants