Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache files for different CachingFileManager objects separately #4879

Merged
merged 36 commits into from
Oct 18, 2022

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Feb 7, 2021

This means that explicitly opening a file multiple times with
open_dataset (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference
counting logic.

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.
@cjauvin
Copy link
Contributor

cjauvin commented Feb 7, 2021

Thank you for the feedback! I quickly tested your suggested fix against the script I refered to in my original issue, and it's still behaving the same if I'm not mistaken. I looked very quickly so perhaps I'm wrong, but what I seem to understand is that your fix is similar to an idea my colleague @huard had, which was to make the cached item more granular by adding a call to Path(..).stat() in the cache key tuple (the idea being that if the file has changed on disk between the two open calls, this will detect it). It doesn't work because (I think) it doesn't change the fact that the underlying netcdf file is never explicitly close, that is, this line is never called:

Sorry in advance if something in my analysis is wrong, which is very likely!

@cjauvin
Copy link
Contributor

cjauvin commented Feb 8, 2021

As my colleague @huard suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue):

master...cjauvin:add-netcdf-refresh-test

As I explained in the issue I have a potential fix for the problem:

master...cjauvin:netcdf-caching-bug

but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it.

@shoyer
Copy link
Member Author

shoyer commented Feb 9, 2021 via email

@dcherian dcherian mentioned this pull request Feb 11, 2021
6 tasks
@shoyer
Copy link
Member Author

shoyer commented Sep 27, 2022

I added @cjauvin's integration test, and verified that the fix works for the scipy and h5netcdf backends.

Unfortunately, it doesn't work yet for the netCDF4 backend. I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library: Unidata/netcdf4-python#1195

@dcherian
Copy link
Contributor

I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library:

I think we should document this and merge. Though the test failures are real (having trouble cleaning up on windows when deleting the temp file), and the diff includes some zarr files right now.

@shoyer
Copy link
Member Author

shoyer commented Oct 5, 2022

OK, after a bit more futzing tests are passing and I think this is actually ready to go in. I ended up leaving in the reference counting after all -- I couldn't figure out another way to keep files open after a pickle round-trip.

@shoyer
Copy link
Member Author

shoyer commented Oct 5, 2022

Actually maybe we don't need to keep files open after pickling... let me give this one more try.

Nevermind, this didn't work -- it still results in failing tests with dask-distributed on Windows.

@shoyer
Copy link
Member Author

shoyer commented Oct 5, 2022

Anyone want to review here? I think this should be ready to go in.

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typing suggestions while we're at it.

xarray/backends/file_manager.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/backends/file_manager.py Outdated Show resolved Hide resolved
xarray/backends/file_manager.py Outdated Show resolved Hide resolved
"deallocating {}, but file is not already closed. "
"This may indicate a bug.".format(self),
f"deallocating {self}, but file is not already closed. "
"This may indicate a bug.",
RuntimeWarning,
stacklevel=2,
)

def __getstate__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __getstate__(self):
def __getstate__(self) -> tuple[Any, Any, Any, Any, Any, Any]:

The Any's can be replaced with narrower versions, I couldn't figure them out on a quick glance.

xarray/backends/file_manager.py Outdated Show resolved Hide resolved
@dcherian dcherian added the plan to merge Final call for comments label Oct 12, 2022
@dcherian dcherian enabled auto-merge (squash) October 13, 2022 17:09
@dcherian
Copy link
Contributor

mypy error:

xarray/backends/file_manager.py:277: error: Accessing "init" on an instance is unsound, since instance.init could be from an incompatible subclass [misc]

for

    def __setstate__(self, state):
        """Restore from a pickle."""
        opener, args, mode, kwargs, lock = state
        self.__init__(opener, *args, mode=mode, kwargs=kwargs, lock=lock)

Seems like we can just ignore?

@dcherian dcherian disabled auto-merge October 13, 2022 17:20
@shoyer
Copy link
Member Author

shoyer commented Oct 13, 2022

I think we could fix this by marking CachingFileManager with typing.final

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

doc/whats-new.rst Outdated Show resolved Hide resolved
self._mode,
self._kwargs,
lock,
self._manager_id,
Copy link
Collaborator

@headtr1ck headtr1ck Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough what exactly this is used for, but make sure that you don't need to do a similar thing as for lock (replace with None in case it is default).
But ignore this comment if this is totally intentional :)

* main:
  Add import ASV benchmark (pydata#7176)
  Rework docs about scatter plots (pydata#7169)
  Fix some scatter plot issues (pydata#7167)
  Fix doctest warnings, enable errors in CI (pydata#7166)
  fix broken test (pydata#7168)
  Add typing to plot methods (pydata#7052)
  Fix warning in doctest (pydata#7165)
  dev whats-new (pydata#7161)
  v2022.10.0 whats-new (pydata#7160)
@dcherian dcherian merged commit 2687536 into pydata:main Oct 18, 2022
keewis pushed a commit to keewis/xarray that referenced this pull request Oct 19, 2022
…ta#4879)

* Cache files for different CachingFileManager objects separately

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.

* Fix whats-new message location

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add id to CachingFileManager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restrict new test to only netCDF files

* fix whats-new message

* skip test on windows

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit e637165.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Fix whats-new message location"

This reverts commit 6bc80e7.

* fixups

* fix syntax

* tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix types for mypy

* add uuid

* restore ref_counts

* doc tweaks

* close files inside test_open_mfdataset_list_attr

* remove unused itertools

* don't use refcounts

* re-enable ref counting

* cleanup

* Apply typing suggestions from code review

Co-authored-by: Illviljan <[email protected]>

* fix import of Hashable

* ignore __init__ type

* fix whats-new

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: dcherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jupyter repr caching deleted netcdf file
5 participants