Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose AllBytes cache #462

Merged
merged 2 commits into from
Nov 2, 2020
Merged

Expose AllBytes cache #462

merged 2 commits into from
Nov 2, 2020

Conversation

martindurant
Copy link
Member

No description provided.

@martindurant
Copy link
Member Author

Note: I am not putting this into the API docs for the time being, let's see how useful it is.

@martindurant martindurant merged commit dd9a8b4 into fsspec:master Nov 2, 2020
@martindurant martindurant deleted the all_cache branch November 2, 2020 14:24
@scottyhq
Copy link

thanks for this @martindurant . I'm still having some trouble implementing things over in intake-xarray, and am realizing I'm just not really understanding the interplay of various caching mechanisms documented here https://filesystem-spec.readthedocs.io/en/latest/features.html. Below are some simple pieces of test code and questions on usage that I'd greatly appreciate your guidance on:

First of all, naively opening and finding the mean of a single group in this 64MB HDF file on S3 takes ~2-4min wall time on my local wifi :(

import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open(uri, anon=True) as f: 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

This new keyword default_cache_type='all' is a vast improvement! (1 to 9s).

import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open(uri, anon=True, default_cache_type='all') as f: 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

But...there seems to be some bug with file closing, because if I re-run the cell I get a traceback

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/fsspec/spec.py in readinto(self, b)
   1418         """
   1419         out = memoryview(b).cast("B")
-> 1420         data = self.read(out.nbytes)
   1421         out[: len(data)] = data
   1422         return len(data)

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/fsspec/spec.py in read(self, length)
   1403             length = self.size - self.loc
   1404         if self.closed:
-> 1405             raise ValueError("I/O operation on closed file.")
   1406         logger.debug("%s read: %i - %i" % (self, self.loc, self.loc + length))
   1407         if length == 0:

ValueError: I/O operation on closed file.

Finally, I'm wondering how using simplecache changes things, because I notice a few things using the simplecache:: prefix.

  1. first computation takes a bit longer 20-40s. Second run cache is clearly noticeable w/ 200ms wall times. It's not clear to me if simplecache is actually writing a temporary file to local disk or keeping things in memory?
  2. I had to change anon=True to s3={'anon': True} or else I get a ListObjectsV2 operation: Access Denied
%%time 
import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open('simplecache::'+uri, s3={'anon': True}, default_cache_type='all') as f: #anon=True 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants