Tweak to to_dask_dataframe() #1667

shoyer · 2017-10-28T17:35:29Z

Follow on to #1489.

Add a dim_order argument
Always write columns for each dimension
Docstring to NumPy format
Tests added / passed
Passes git diff upstream/master **/*py | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

- Add a `dim_order` argument - Always write columns for each dimension - Docstring to NumPy format

shoyer · 2017-10-28T17:36:59Z

xarray/core/dataset.py


-            # ensure all variables have the same chunking structure
-            if v.chunks != chunks:
-                v = v.chunk(chunks)


@jmunroe was there was reason why you didn't just chunk everything here?

On the assumption (probably mistaken) that there was a cost to calling .chunk(chunks) on a variable that already had that chunking structure. If that assumption was not correct, then, yes, everything could just be chunked.

Rechunk is nearly free if chunks are unchanged -- it actually returns the same dask array object.

shoyer · 2017-10-30T16:11:15Z

@jhamman can you please take a look here when you have the chance?

jhamman

a few minor comments

jhamman · 2017-10-30T21:54:17Z

xarray/core/dataset.py

-            if isinstance(v, xr.IndexVariable):
-                v = v.to_base_variable()
+        if dim_order is None:
+            dim_order = list(self.dims)


I can't seem to remember but is this always a sorted tuple/dict?

For Dataset, it's a SortedKeysDict (i.e., the dimensions in alphabetical order).

jhamman · 2017-10-30T21:58:16Z

xarray/core/dataset.py

+                var = self.variables[name]
+            except KeyError:
+                # dimension without a matching coordinate
+                values = np.arange(self.dims[name], dtype=np.int64)


can we initialize this as a dask array to avoid creating the array when it will not be used.

Yes, good idea, will do

jhamman · 2017-10-30T22:07:39Z

xarray/tests/test_dask.py

+        ds['y'] = ('y', list('abc'))
+
+        expected = ds.compute().to_dataframe()
+        actual = ds.to_dask_dataframe(set_index=True)


rather than using xfail above, use raises_regex to make sure we raise an error in the correct line.

My reasoning on using xfail was that that makes this test more robust. If/when dask implementing MultiIndex, we'll just get an unexpected xfail rather than a failing test. NotImplementedError seems specific enough (unlike, e.g., ValueError) that I'm not concerned about grepping for the exact error message.

Tweak to to_dask_dataframe()

ed0bf08

- Add a `dim_order` argument - Always write columns for each dimension - Docstring to NumPy format

shoyer requested a review from jhamman October 28, 2017 17:35

shoyer commented Oct 28, 2017

View reviewed changes

shoyer added 2 commits October 28, 2017 13:35

Fix windows test failure

1b723fb

More windows failure

8ee4023

shoyer mentioned this pull request Oct 29, 2017

v0.10 Release #1535

Closed

13 tasks

shoyer added 2 commits October 29, 2017 10:41

Merge branch 'master' into dask_dataframe_tweak

69dde7b

Fix failing test

922cff5

jhamman reviewed Oct 30, 2017

View reviewed changes

Use da.arange() inside to_dask_dataframe

c4f166b

jhamman approved these changes Oct 31, 2017

View reviewed changes

jhamman merged commit 7e9193c into pydata:master Oct 31, 2017

shoyer deleted the dask_dataframe_tweak branch October 31, 2017 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak to to_dask_dataframe() #1667

Tweak to to_dask_dataframe() #1667

shoyer commented Oct 28, 2017 •

edited

Loading

shoyer Oct 28, 2017

jmunroe Oct 29, 2017

shoyer Oct 29, 2017

shoyer commented Oct 30, 2017

jhamman left a comment

jhamman Oct 30, 2017

shoyer Oct 30, 2017

jhamman Oct 30, 2017

shoyer Oct 30, 2017

jhamman Oct 30, 2017

shoyer Oct 30, 2017

Tweak to to_dask_dataframe() #1667

Tweak to to_dask_dataframe() #1667

Conversation

shoyer commented Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 30, 2017

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 28, 2017 •

edited

Loading