Stop loading tutorial data by default #2538

jhamman · 2018-11-03T17:24:26Z

Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

In working on an xarray/dask tutorial, I've come to realize we eagerly load the tutorial datasets in xarray.tutorial.load_dataset. I'm going to just say that I don't think we should do that but I could be missing some rational. I didn't open an issue so please feel free to share thoughts here.

One option would be to create a new function (xr.tutorial.open_dataset) that does what I'm suggesting and then slowly deprecate tutorial.load_dataset. Thoughts?

xref: dask/dask-examples#51

pep8speaks · 2018-11-03T17:24:29Z

Hello @jhamman! Thanks for updating the PR.

There are no PEP8 issues in the file xarray/tests/test_tutorial.py !
There are no PEP8 issues in the file xarray/tutorial.py !

Comment last updated on November 05, 2018 at 14:17 Hours UTC

shoyer · 2018-11-03T21:17:02Z

Our current tutorial datasets are 8MB and 17MB, which is pretty small. You'll definitely get better performance loading datasets of this size into NumPy arrays.

jhamman · 2018-11-04T17:19:15Z

@shoyer - absolutely we'll get better performance with numpy arrays in this case. So I'm trying to use our tutorial datasets for some examples with dask (dask/dask-examples#51). The docstring for the load_dataset function states that we can pass kwargs on to the open_dataset function but if we pass chunks to the load_dataset call currently, we still get data back as numpy arrays. We have some other options here:

if chunks is a kwargs, return a dataset with data as persisted dask arrays
provide a second function to handle returning datasets using the same logic as open_dataset (caching, dask arrays, lazy loading, etc.)
tell people (like me) to rechunk the dataset after the fact

(3) won't require any changes but makes it a little harder to connect the typical use pattern of open_dataset with tutorial.load_dataset.

shoyer · 2018-11-04T17:29:11Z

OK, that seems reasonable. The default behavior should cache the arrays loaded with NumPy anyways. I would not be opposed to renaming this to open_dataset, either.

…

On Sun, Nov 4, 2018 at 9:19 AM Joe Hamman ***@***.***> wrote: @shoyer <https://github.com/shoyer> - absolutely we'll get better performance with numpy arrays in this case. So I'm trying to use our tutorial datasets for some examples with dask (dask/dask-examples#51 <dask/dask-examples#51>). The docstring for the load_dataset function states that we can pass kwargs on to the open_dataset function but if we pass chunks to the load_dataset call currently, we still get data back as numpy arrays. We have some other options here: 1. if chunks is a kwargs, return a dataset with data as persisted dask arrays 2. provide a second function to handle returning datasets using the same logic as open_dataset (caching, dask arrays, lazy loading, etc.) 3. tell people (like me) to rechunk the dataset after the fact (3) won't require any changes but makes it a little harder to connect the typical use pattern of open_dataset with tutorial.load_dataset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2538 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1mBjbk7l2qXi4EqFtMGdvDDoPJHaks5uryGUgaJpZM4YM5-d> .

shoyer · 2018-11-05T01:59:34Z

The default behavior should cache the arrays loaded with NumPy anyways.

Sorry, to be clear what I meant here is that by default arrays loaded with NumPy get cached after the first/access/operation. Not that we need to preserve the existing behavior of load_dataset().

jhamman · 2018-11-05T02:39:50Z

@shoyer - I think I was tracking with you. I've gone ahead and deprecated the current load_dataset in favor of the open_dataset name. The switch is accompanied by a change in behavior as well.

xarray/tests/test_tutorial.py

xarray/tutorial.py

* upstream/master: (122 commits) add missing , and article in error message (pydata#2557) Add libnetcdf, libhdf5, pydap and cfgrib to xarray.show_versions() (pydata#2555) revert to dev version for 0.11.1 Release xarray v0.11 DOC: update whatsnew for xarray 0.11 release (pydata#2548) Drop the hack needed to use CachingFileManager as we don't use it anymore. (pydata#2544) add full test env for py37 ci env (pydata#2545) Remove old-style resample example in documentation (pydata#2543) Stop loading tutorial data by default (pydata#2538) Remove the old syntax for resample. (pydata#2541) Remove use of deprecated, unused keyword. (pydata#2540) Deprecate inplace (pydata#2524) Zarr chunking (GH2300) (pydata#2487) Include multidimensional stacking groupby in docs (pydata#2493) (pydata#2536) Switch enable_cftimeindex to True by default (pydata#2516) Raise more informative error when converting tuples to Variable. (pydata#2523) Global option to always keep/discard attrs on operations (pydata#2482) Remove tests where answers change in cftime 1.0.2.1 (pydata#2522) Finish deprecation cycle for DataArray.__contains__ checking array values (pydata#2520) Fix bug where OverflowError is not being raised (pydata#2519) ...

putting up for discussion: stop loading tutorial data by default

5c3a73f

jhamman mentioned this pull request Nov 4, 2018

Xarray and Dask Example dask/dask-examples#51

Merged

Joseph Hamman added 2 commits November 4, 2018 13:47

add tutorial.open_dataset

abcaf51

fix typo

75de32b

shoyer reviewed Nov 5, 2018

View reviewed changes

xarray/tests/test_tutorial.py Outdated Show resolved Hide resolved

xarray/tutorial.py Show resolved Hide resolved

xarray/tutorial.py Outdated Show resolved Hide resolved

xarray/tutorial.py Outdated Show resolved Hide resolved

add test for cached tutoreial data and minor doc fixes

8d7c25b

shoyer approved these changes Nov 5, 2018

View reviewed changes

jhamman merged commit 55f21de into pydata:master Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop loading tutorial data by default #2538

Stop loading tutorial data by default #2538

jhamman commented Nov 3, 2018 •

edited

Loading

pep8speaks commented Nov 3, 2018 •

edited

Loading

shoyer commented Nov 3, 2018

jhamman commented Nov 4, 2018

shoyer commented Nov 4, 2018 via email

shoyer commented Nov 5, 2018

jhamman commented Nov 5, 2018

Stop loading tutorial data by default #2538

Stop loading tutorial data by default #2538

Conversation

jhamman commented Nov 3, 2018 • edited Loading

pep8speaks commented Nov 3, 2018 • edited Loading

Comment last updated on November 05, 2018 at 14:17 Hours UTC

shoyer commented Nov 3, 2018

jhamman commented Nov 4, 2018

shoyer commented Nov 4, 2018 via email

shoyer commented Nov 5, 2018

jhamman commented Nov 5, 2018

jhamman commented Nov 3, 2018 •

edited

Loading

pep8speaks commented Nov 3, 2018 •

edited

Loading