-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray.backends refactor #2261
xarray.backends refactor #2261
Conversation
This is intended to replace both PickleByReconstructionWrapper and DataStorePickleMixin with something more compartmentalized. xref GH2121
xarray/backends/file_manager.py
Outdated
Callable that opens a given file when called, returning a file | ||
object. | ||
mode : str, optional | ||
If provided, passed to opener as a keyword argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W291 trailing whitespace
ExplicitFileManager, LazyFileManager, AutoclosingFileManager, | ||
] | ||
|
||
@pytest.mark.parametrize('manager_type', FILE_MANAGERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E302 expected 2 blank lines, found 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what I'm seeing here. Mostly API questions for now, I did not review the tests yet.
xarray/backends/file_manager.py
Outdated
class FileManager(object): | ||
"""Base class for context managers for managing file objects. | ||
Unlike files, FileManager objects should be safely. They must be explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this description is missing a few words.
xarray/backends/file_manager.py
Outdated
manager.close() | ||
""" | ||
|
||
def __init__(self, opener, mode=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also support **kwargs
here. Or maybe that's all we should support here. Or, perhaps you are thinking opener
would be a partial function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of opener
as a partial, but I agree that it would probably be easier to understand if args
and kwargs
are passed directly.
xarray/backends/file_manager.py
Outdated
class ExplicitFileManager(FileManager): | ||
"""A file manager that holds a file open until explicitly closed. | ||
This is mostly a reference implementation: must real use cases should use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must->most
xarray/backends/file_manager.py
Outdated
|
||
def __init__(self, opener, mode=_DEFAULT_MODE): | ||
self._opener = opener | ||
# file has already been created, don't override when restoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand on this a bit? How do we KNOW that the file has already been created? I'm wondering if the mode switch should go after the file open line.
xarray/backends/file_manager.py
Outdated
def __init__(self, opener, mode=_DEFAULT_MODE): | ||
self._opener = opener | ||
self._mode = mode | ||
self._lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on allowing other locks to be passed in here? Do we need to support the CombinedLock
concept as well?
@jhamman thanks for taking a look. I'm going to push another iteration of this shortly (OK, a major rewrite) where there is only a single FileManager object which uses an LRU cache. |
xarray/backends/rasterio_.py
Outdated
@@ -1,5 +1,6 @@ | |||
import os | |||
from collections import OrderedDict | |||
import functools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'functools' imported but unused
@@ -0,0 +1,60 @@ | |||
import pickle | |||
|
|||
from xarray.backends.file_manager import FileManager, FILE_CACHE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'xarray.backends.file_manager.FILE_CACHE' imported but unused
xarray/backends/file_manager.py
Outdated
manager.close() | ||
""" | ||
|
||
def __init__(self, opener, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I want to use dependency injection for the cache (e.g., cache=FILE_CACHE
), which unfortunately means that we'll need to change the signature here from using **kwargs
.
Any opinions on what this should look like? I'm thinking maybe:
_DEFAULT = object()
def __init__(self, opener, *args, mode=_DEFAULT, kwargs=None, cache=FILE_CACHE)
xarray/backends/file_manager.py
Outdated
""" | ||
|
||
def __init__(self, opener, *args, | ||
mode=_DEFAULT_MODE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E999 SyntaxError: invalid syntax
OK, this is ready for review. |
xarray/tests/test_backends.py
Outdated
assert_identical(expected, actual) | ||
with self.roundtrip(expected, | ||
save_kwargs=fmtkw, | ||
open_kwargs={'backend_kwargs': fmtkw}) as actual: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (81 > 79 characters)
As an experiment, I rewrote the SciPy netCDF backend to use FileManager:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - nice work on this. I was skeptical on this one until I saw how it cleaned up the backend implementations. I'm sold!
I haven't looked at the tests just yet but will get to them this week.
self._mode = 'a' | ||
self._key = self._make_key() | ||
self._cache[self._key] = file | ||
return file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to override the builtin file
function here. Perhaps we can use fh
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file
is only a builtin on Python 2... are we still concerned about overriding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know this and I'm happy to hear it. (I can't wait to be done with Python 2)
xarray/backends/lru_cache.py
Outdated
value = self._cache[key] | ||
# On Python 3, could just use: self._cache.move_to_end(key) | ||
del self._cache[key] | ||
self._cache[key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using the move_to_end
here and catching the exception for python2 only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe a little helper function in pycompat that we can cleanup when python 2 is dropped.
def move_to_end(cache, key):
try:
cache.move_to_end(key)
except AttributeError:
del self._cache[key]
self._cache[key] = value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, easy enough. Done.
""" | ||
if maxsize < 0: | ||
raise ValueError('maxsize must be non-negative') | ||
self._maxsize = maxsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we enforce maxsize is an integer? I'm thinking that it may be easy to see None
/False
as valid values. I think that case is going to break things downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
xarray/backends/common.py
Outdated
import logging | ||
import multiprocessing | ||
import threading | ||
import time | ||
import traceback | ||
import warnings | ||
import weakref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'weakref' imported but unused
xarray/backends/scipy_.py
Outdated
self._opener = opener | ||
self._mode = mode | ||
if (lock is None and mode != 'r' | ||
and isinstance(filename_or_obj, basestring)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
I'd love to move this forward. I think it will fix some serious usability and performance issues with distributed reads/writes of netCDF files. |
I'd also be happy to see this go in. We could use a review from someone other than me. |
At some point soon I'm just going to merge this, more review or not! Hopefully a release candidate will catch any major issues. |
Hello @shoyer! Thanks for updating the PR.
Comment last updated on October 09, 2018 at 02:31 Hours UTC |
based on the arrival of #2476 (!), I suggest we merge this. I think we've had enough review to justify this being put into a release candidate in the relatively near future. |
Yep, that's my plan. I just did a read through code again and identified a few unreachable lines, which I removed. I'll merge when CI passes. |
Nice work on this @shoyer. Really excited to set this free. |
* master: (51 commits) xarray.backends refactor (pydata#2261) Fix indexing error for data loaded with open_rasterio (pydata#2456) Properly support user-provided norm. (pydata#2443) pep8speaks (pydata#2462) isort (pydata#2469) tests shoudn't need to pass for a PR (pydata#2471) Replace the last of unittest with pytest (pydata#2467) Add python_requires to setup.py (pydata#2465) Update whats-new.rst (pydata#2466) Clean up _parse_array_of_cftime_strings (pydata#2464) plot.contour: Don't make cmap if colors is a single color. (pydata#2453) np.AxisError was added in numpy 1.13 (pydata#2455) Add CFTimeIndex.shift (pydata#2431) Fix FutureWarning in CFTimeIndex.date_type (pydata#2448) fix:2445 (pydata#2446) Enable use of cftime.datetime coordinates with differentiate and interp (pydata#2434) restore ddof support in std (pydata#2447) Future warning for default reduction dimension of groupby (pydata#2366) Remove incorrect statement about "drop" in the text docs (pydata#2439) Use profile mechanism, not no-op mutation (pydata#2442) ...
* master: (21 commits) xarray.backends refactor (pydata#2261) Fix indexing error for data loaded with open_rasterio (pydata#2456) Properly support user-provided norm. (pydata#2443) pep8speaks (pydata#2462) isort (pydata#2469) tests shoudn't need to pass for a PR (pydata#2471) Replace the last of unittest with pytest (pydata#2467) Add python_requires to setup.py (pydata#2465) Update whats-new.rst (pydata#2466) Clean up _parse_array_of_cftime_strings (pydata#2464) plot.contour: Don't make cmap if colors is a single color. (pydata#2453) np.AxisError was added in numpy 1.13 (pydata#2455) Add CFTimeIndex.shift (pydata#2431) Fix FutureWarning in CFTimeIndex.date_type (pydata#2448) fix:2445 (pydata#2446) Enable use of cftime.datetime coordinates with differentiate and interp (pydata#2434) restore ddof support in std (pydata#2447) Future warning for default reduction dimension of groupby (pydata#2366) Remove incorrect statement about "drop" in the text docs (pydata#2439) Use profile mechanism, not no-op mutation (pydata#2442) ...
The xarray.backends.api.to_netcdf function has been changed in pydata/xarray#2261
A major refactor of xarray backend classes:
PickleByReconstructionWrapper
andDataStorePickleMixin
have been eliminated in favor ofCachingFIleManager
.to_netcdf
/open_dataset
.xref #2121
fixes #1738
fixes #2376
Benchmark numbers: