Cache files for different CachingFileManager objects separately #4879

shoyer · 2021-02-07T21:48:06Z

This means that explicitly opening a file multiple times with
open_dataset (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference
counting logic.

Closes jupyter repr caching deleted netcdf file #4240, Obtaining fresh data from the disk when reopening a NetCDF file a second time #4862
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic.

cjauvin · 2021-02-07T22:40:44Z

Thank you for the feedback! I quickly tested your suggested fix against the script I refered to in my original issue, and it's still behaving the same if I'm not mistaken. I looked very quickly so perhaps I'm wrong, but what I seem to understand is that your fix is similar to an idea my colleague @huard had, which was to make the cached item more granular by adding a call to Path(..).stat() in the cache key tuple (the idea being that if the file has changed on disk between the two open calls, this will detect it). It doesn't work because (I think) it doesn't change the fact that the underlying netcdf file is never explicitly close, that is, this line is never called:

xarray/xarray/backends/file_manager.py

Line 222 in a5f53e2

file.close()

Sorry in advance if something in my analysis is wrong, which is very likely!

cjauvin · 2021-02-08T21:56:21Z

As my colleague @huard suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue):

master...cjauvin:add-netcdf-refresh-test

As I explained in the issue I have a potential fix for the problem:

master...cjauvin:netcdf-caching-bug

but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it.

shoyer · 2021-02-09T08:05:40Z

Thanks for sharing the test case, which I'll add to this PR. I ran it locally and it seems to pass with this PR for the netCDF4 and SciPy netCDF backends (but not h5netcdf). So I'm not entirely sure what to make of that.

…

On Mon, Feb 8, 2021 at 1:56 PM Christian Jauvin ***@***.***> wrote: As my colleague @huard <https://github.com/huard> suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue <#4862>): master...cjauvin:add-netcdf-refresh-test <master...cjauvin:add-netcdf-refresh-test> As I explained in the issue I have a potential fix for the problem: master...cjauvin:netcdf-caching-bug <master...cjauvin:netcdf-caching-bug> but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4879 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVTISNAZV4YYOUMURLLS6BMZHANCNFSM4XH4MWIA> .

for more information, see https://pre-commit.ci

shoyer · 2022-09-27T17:26:06Z

I added @cjauvin's integration test, and verified that the fix works for the scipy and h5netcdf backends.

Unfortunately, it doesn't work yet for the netCDF4 backend. I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library: Unidata/netcdf4-python#1195

dcherian · 2022-09-28T17:52:51Z

I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library:

I think we should document this and merge. Though the test failures are real (having trouble cleaning up on windows when deleting the temp file), and the diff includes some zarr files right now.

This reverts commit e637165.

for more information, see https://pre-commit.ci

This reverts commit 6bc80e7.

for more information, see https://pre-commit.ci

shoyer · 2022-10-05T16:51:14Z

OK, after a bit more futzing tests are passing and I think this is actually ready to go in. I ended up leaving in the reference counting after all -- I couldn't figure out another way to keep files open after a pickle round-trip.

shoyer · 2022-10-05T17:06:02Z

~~Actually maybe we don't need to keep files open after pickling... let me give this one more try.~~

Nevermind, this didn't work -- it still results in failing tests with dask-distributed on Windows.

shoyer · 2022-10-05T22:27:28Z

Anyone want to review here? I think this should be ready to go in.

Illviljan

Some typing suggestions while we're at it.

xarray/backends/file_manager.py

xarray/tests/test_backends.py

xarray/backends/file_manager.py

Illviljan · 2022-10-06T10:35:09Z

xarray/backends/file_manager.py

-                    "deallocating {}, but file is not already closed. "
-                    "This may indicate a bug.".format(self),
+                    f"deallocating {self}, but file is not already closed. "
+                    "This may indicate a bug.",
                    RuntimeWarning,
                    stacklevel=2,
                )

    def __getstate__(self):


Suggested change

def __getstate__(self):

def __getstate__(self) -> tuple[Any, Any, Any, Any, Any, Any]:

The Any's can be replaced with narrower versions, I couldn't figure them out on a quick glance.

xarray/backends/file_manager.py

Co-authored-by: Illviljan <[email protected]>

dcherian · 2022-10-13T17:20:38Z

mypy error:

xarray/backends/file_manager.py:277: error: Accessing "init" on an instance is unsound, since instance.init could be from an incompatible subclass [misc]

for

    def __setstate__(self, state):
        """Restore from a pickle."""
        opener, args, mode, kwargs, lock = state
        self.__init__(opener, *args, mode=mode, kwargs=kwargs, lock=lock)

Seems like we can just ignore?

shoyer · 2022-10-13T21:34:05Z

I think we could fix this by marking CachingFileManager with typing.final

headtr1ck

LGTM

doc/whats-new.rst

headtr1ck · 2022-10-16T16:58:55Z

xarray/backends/file_manager.py

+            self._mode,
+            self._kwargs,
+            lock,
+            self._manager_id,


I don't know enough what exactly this is used for, but make sure that you don't need to do a similar thing as for lock (replace with None in case it is default).
But ignore this comment if this is totally intentional :)

* main: Add import ASV benchmark (pydata#7176) Rework docs about scatter plots (pydata#7169) Fix some scatter plot issues (pydata#7167) Fix doctest warnings, enable errors in CI (pydata#7166) fix broken test (pydata#7168) Add typing to plot methods (pydata#7052) Fix warning in doctest (pydata#7165) dev whats-new (pydata#7161) v2022.10.0 whats-new (pydata#7160)

…ta#4879) * Cache files for different CachingFileManager objects separately This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic. * Fix whats-new message location * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add id to CachingFileManager * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restrict new test to only netCDF files * fix whats-new message * skip test on windows * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit e637165. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Fix whats-new message location" This reverts commit 6bc80e7. * fixups * fix syntax * tweaks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix types for mypy * add uuid * restore ref_counts * doc tweaks * close files inside test_open_mfdataset_list_attr * remove unused itertools * don't use refcounts * re-enable ref counting * cleanup * Apply typing suggestions from code review Co-authored-by: Illviljan <[email protected]> * fix import of Hashable * ignore __init__ type * fix whats-new Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Illviljan <[email protected]> Co-authored-by: dcherian <[email protected]>

This was referenced Feb 7, 2021

jupyter repr caching deleted netcdf file #4240

Closed

Obtaining fresh data from the disk when reopening a NetCDF file a second time #4862

Closed

dcherian mentioned this pull request Feb 11, 2021

release 0.17.0 #4894

Closed

6 tasks

Merge branch 'main' into file-manager-not-shared

d8b3bb7

github-actions bot added io topic-backends labels Sep 27, 2022

shoyer and others added 7 commits September 27, 2022 00:03

Fix whats-new message location

6bc80e7

[pre-commit.ci] auto fixes from pre-commit.com hooks

e637165

for more information, see https://pre-commit.ci

Add id to CachingFileManager

d587bfc

Merge branch 'main' into file-manager-not-shared

4f4ba13

[pre-commit.ci] auto fixes from pre-commit.com hooks

e93f1f5

for more information, see https://pre-commit.ci

restrict new test to only netCDF files

257eb00

fix whats-new message

7d857f3

skip test on windows

a3556d1

shoyer and others added 10 commits September 28, 2022 14:00

Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

7c7c4e8

This reverts commit e637165.

[pre-commit.ci] auto fixes from pre-commit.com hooks

c51e81e

for more information, see https://pre-commit.ci

Revert "Fix whats-new message location"

89c2b55

This reverts commit 6bc80e7.

fixups

c320acb

fix syntax

2cab733

Merge branch 'main' into file-manager-not-shared

997c3d4

tweaks

a3486c8

[pre-commit.ci] auto fixes from pre-commit.com hooks

d24914e

for more information, see https://pre-commit.ci

fix types for mypy

7105ec2

add uuid

38c2a16

shoyer added 5 commits October 4, 2022 15:02

restore ref_counts

6d6e2dd

Merge branch 'main' into file-manager-not-shared

d95f9f0

doc tweaks

a837f3b

close files inside test_open_mfdataset_list_attr

25706eb

remove unused itertools

1466c82

shoyer added 3 commits October 5, 2022 10:06

don't use refcounts

382d734

re-enable ref counting

46f4fef

cleanup

3ec678e

Illviljan reviewed Oct 6, 2022

View reviewed changes

Merge branch 'main' into file-manager-not-shared

929e5d1

dcherian added the plan to merge Final call for comments label Oct 12, 2022

dcherian and others added 4 commits October 12, 2022 11:28

Apply typing suggestions from code review

fe7b3c3

Co-authored-by: Illviljan <[email protected]>

Merge branch 'main' into file-manager-not-shared

915976d

fix import of Hashable

06c5d51

Merge branch 'main' into file-manager-not-shared

cb16f88

dcherian enabled auto-merge (squash) October 13, 2022 17:09

dcherian disabled auto-merge October 13, 2022 17:20

ignore __init__ type

e05cb3b

headtr1ck approved these changes Oct 16, 2022

View reviewed changes

dcherian added 2 commits October 17, 2022 09:28

fix whats-new

a5bf621

dcherian merged commit 2687536 into pydata:main Oct 18, 2022

keewis mentioned this pull request May 3, 2024

segfault with a particular netcdf4 file #8289

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache files for different CachingFileManager objects separately #4879

Cache files for different CachingFileManager objects separately #4879

shoyer commented Feb 7, 2021

cjauvin commented Feb 7, 2021 •

edited

Loading

cjauvin commented Feb 8, 2021

shoyer commented Feb 9, 2021 via email

shoyer commented Sep 27, 2022

dcherian commented Sep 28, 2022

shoyer commented Oct 5, 2022

shoyer commented Oct 5, 2022 •

edited

Loading

shoyer commented Oct 5, 2022

Illviljan left a comment

Illviljan Oct 6, 2022

dcherian commented Oct 13, 2022

shoyer commented Oct 13, 2022

headtr1ck left a comment

headtr1ck Oct 16, 2022 •

edited

Loading

	def __getstate__(self):
	def __getstate__(self) -> tuple[Any, Any, Any, Any, Any, Any]:

Cache files for different CachingFileManager objects separately #4879

Cache files for different CachingFileManager objects separately #4879

Conversation

shoyer commented Feb 7, 2021

cjauvin commented Feb 7, 2021 • edited Loading

cjauvin commented Feb 8, 2021

shoyer commented Feb 9, 2021 via email

shoyer commented Sep 27, 2022

dcherian commented Sep 28, 2022

shoyer commented Oct 5, 2022

shoyer commented Oct 5, 2022 • edited Loading

shoyer commented Oct 5, 2022

Illviljan left a comment

Choose a reason for hiding this comment

Illviljan Oct 6, 2022

Choose a reason for hiding this comment

dcherian commented Oct 13, 2022

shoyer commented Oct 13, 2022

headtr1ck left a comment

Choose a reason for hiding this comment

headtr1ck Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

cjauvin commented Feb 7, 2021 •

edited

Loading

shoyer commented Oct 5, 2022 •

edited

Loading

headtr1ck Oct 16, 2022 •

edited

Loading