-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache files for different CachingFileManager objects separately #4879
Conversation
This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic.
Thank you for the feedback! I quickly tested your suggested fix against the script I refered to in my original issue, and it's still behaving the same if I'm not mistaken. I looked very quickly so perhaps I'm wrong, but what I seem to understand is that your fix is similar to an idea my colleague @huard had, which was to make the cached item more granular by adding a call to xarray/xarray/backends/file_manager.py Line 222 in a5f53e2
Sorry in advance if something in my analysis is wrong, which is very likely! |
As my colleague @huard suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue): master...cjauvin:add-netcdf-refresh-test As I explained in the issue I have a potential fix for the problem: master...cjauvin:netcdf-caching-bug but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it. |
Thanks for sharing the test case, which I'll add to this PR. I ran it
locally and it seems to pass with this PR for the netCDF4 and SciPy netCDF
backends (but not h5netcdf). So I'm not entirely sure what to make of that.
…On Mon, Feb 8, 2021 at 1:56 PM Christian Jauvin ***@***.***> wrote:
As my colleague @huard <https://github.com/huard> suggested, I have
written an additional test which demonstrates the problem (essentially the
same idea I proposed in my initial issue
<#4862>):
master...cjauvin:add-netcdf-refresh-test
<master...cjauvin:add-netcdf-refresh-test>
As I explained in the issue I have a potential fix for the problem:
master...cjauvin:netcdf-caching-bug
<master...cjauvin:netcdf-caching-bug>
but the problem is that it feels a bit weird to have to that, so I suspect
that there's a better way to solve it.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4879 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVTISNAZV4YYOUMURLLS6BMZHANCNFSM4XH4MWIA>
.
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I added @cjauvin's integration test, and verified that the fix works for the scipy and h5netcdf backends. Unfortunately, it doesn't work yet for the netCDF4 backend. I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library: Unidata/netcdf4-python#1195 |
I think we should document this and merge. Though the test failures are real (having trouble cleaning up on windows when deleting the temp file), and the diff includes some zarr files right now. |
This reverts commit e637165.
for more information, see https://pre-commit.ci
This reverts commit 6bc80e7.
for more information, see https://pre-commit.ci
OK, after a bit more futzing tests are passing and I think this is actually ready to go in. I ended up leaving in the reference counting after all -- I couldn't figure out another way to keep files open after a pickle round-trip. |
Nevermind, this didn't work -- it still results in failing tests with dask-distributed on Windows. |
Anyone want to review here? I think this should be ready to go in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typing suggestions while we're at it.
"deallocating {}, but file is not already closed. " | ||
"This may indicate a bug.".format(self), | ||
f"deallocating {self}, but file is not already closed. " | ||
"This may indicate a bug.", | ||
RuntimeWarning, | ||
stacklevel=2, | ||
) | ||
|
||
def __getstate__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __getstate__(self): | |
def __getstate__(self) -> tuple[Any, Any, Any, Any, Any, Any]: |
The Any's can be replaced with narrower versions, I couldn't figure them out on a quick glance.
mypy error:
for def __setstate__(self, state):
"""Restore from a pickle."""
opener, args, mode, kwargs, lock = state
self.__init__(opener, *args, mode=mode, kwargs=kwargs, lock=lock) Seems like we can just ignore? |
I think we could fix this by marking CachingFileManager with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
self._mode, | ||
self._kwargs, | ||
lock, | ||
self._manager_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough what exactly this is used for, but make sure that you don't need to do a similar thing as for lock (replace with None in case it is default).
But ignore this comment if this is totally intentional :)
* main: Add import ASV benchmark (pydata#7176) Rework docs about scatter plots (pydata#7169) Fix some scatter plot issues (pydata#7167) Fix doctest warnings, enable errors in CI (pydata#7166) fix broken test (pydata#7168) Add typing to plot methods (pydata#7052) Fix warning in doctest (pydata#7165) dev whats-new (pydata#7161) v2022.10.0 whats-new (pydata#7160)
…ta#4879) * Cache files for different CachingFileManager objects separately This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic. * Fix whats-new message location * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add id to CachingFileManager * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restrict new test to only netCDF files * fix whats-new message * skip test on windows * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit e637165. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Fix whats-new message location" This reverts commit 6bc80e7. * fixups * fix syntax * tweaks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix types for mypy * add uuid * restore ref_counts * doc tweaks * close files inside test_open_mfdataset_list_attr * remove unused itertools * don't use refcounts * re-enable ref counting * cleanup * Apply typing suggestions from code review Co-authored-by: Illviljan <[email protected]> * fix import of Hashable * ignore __init__ type * fix whats-new Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Illviljan <[email protected]> Co-authored-by: dcherian <[email protected]>
This means that explicitly opening a file multiple times with
open_dataset
(e.g., after modifying it on disk) now reopens the filefrom scratch, rather than reusing a cached version.
If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.
I think this should some long-standing usability issues: #4240, #4862
Conveniently, this also obviates the need for some messy reference
counting logic.
pre-commit run --all-files
whats-new.rst