Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop loading tutorial data by default #2538

Merged
merged 4 commits into from
Nov 5, 2018
Merged

Stop loading tutorial data by default #2538

merged 4 commits into from
Nov 5, 2018

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Nov 3, 2018

  • Tests added
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

In working on an xarray/dask tutorial, I've come to realize we eagerly load the tutorial datasets in xarray.tutorial.load_dataset. I'm going to just say that I don't think we should do that but I could be missing some rational. I didn't open an issue so please feel free to share thoughts here.

One option would be to create a new function (xr.tutorial.open_dataset) that does what I'm suggesting and then slowly deprecate tutorial.load_dataset. Thoughts?

xref: dask/dask-examples#51

@pep8speaks
Copy link

pep8speaks commented Nov 3, 2018

Hello @jhamman! Thanks for updating the PR.

Comment last updated on November 05, 2018 at 14:17 Hours UTC

@shoyer
Copy link
Member

shoyer commented Nov 3, 2018

Our current tutorial datasets are 8MB and 17MB, which is pretty small. You'll definitely get better performance loading datasets of this size into NumPy arrays.

@jhamman
Copy link
Member Author

jhamman commented Nov 4, 2018

@shoyer - absolutely we'll get better performance with numpy arrays in this case. So I'm trying to use our tutorial datasets for some examples with dask (dask/dask-examples#51). The docstring for the load_dataset function states that we can pass kwargs on to the open_dataset function but if we pass chunks to the load_dataset call currently, we still get data back as numpy arrays. We have some other options here:

  1. if chunks is a kwargs, return a dataset with data as persisted dask arrays
  2. provide a second function to handle returning datasets using the same logic as open_dataset (caching, dask arrays, lazy loading, etc.)
  3. tell people (like me) to rechunk the dataset after the fact

(3) won't require any changes but makes it a little harder to connect the typical use pattern of open_dataset with tutorial.load_dataset.

@shoyer
Copy link
Member

shoyer commented Nov 4, 2018 via email

@shoyer
Copy link
Member

shoyer commented Nov 5, 2018

The default behavior should cache the arrays loaded with NumPy anyways.

Sorry, to be clear what I meant here is that by default arrays loaded with NumPy get cached after the first/access/operation. Not that we need to preserve the existing behavior of load_dataset().

@jhamman
Copy link
Member Author

jhamman commented Nov 5, 2018

@shoyer - I think I was tracking with you. I've gone ahead and deprecated the current load_dataset in favor of the open_dataset name. The switch is accompanied by a change in behavior as well.

xarray/tests/test_tutorial.py Outdated Show resolved Hide resolved
xarray/tutorial.py Show resolved Hide resolved
xarray/tutorial.py Outdated Show resolved Hide resolved
xarray/tutorial.py Outdated Show resolved Hide resolved
@jhamman jhamman merged commit 55f21de into pydata:master Nov 5, 2018
dcherian pushed a commit to yohai/xarray that referenced this pull request Dec 16, 2018
* upstream/master: (122 commits)
  add missing , and article in error message (pydata#2557)
  Add libnetcdf, libhdf5, pydap and cfgrib to xarray.show_versions() (pydata#2555)
  revert to dev version for 0.11.1
  Release xarray v0.11
  DOC: update whatsnew for xarray 0.11 release (pydata#2548)
  Drop the hack needed to use CachingFileManager as we don't use it anymore. (pydata#2544)
  add full test env for py37 ci env (pydata#2545)
  Remove old-style resample example in documentation (pydata#2543)
  Stop loading tutorial data by default (pydata#2538)
  Remove the old syntax for resample. (pydata#2541)
  Remove use of deprecated, unused keyword. (pydata#2540)
  Deprecate inplace (pydata#2524)
  Zarr chunking (GH2300) (pydata#2487)
  Include multidimensional stacking groupby in docs (pydata#2493) (pydata#2536)
  Switch enable_cftimeindex to True by default (pydata#2516)
  Raise more informative error when converting tuples to Variable. (pydata#2523)
  Global option to always keep/discard attrs on operations (pydata#2482)
  Remove tests where answers change in cftime 1.0.2.1 (pydata#2522)
  Finish deprecation cycle for DataArray.__contains__ checking array values (pydata#2520)
  Fix bug where OverflowError is not being raised (pydata#2519)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants