Expose AllBytes cache #462

martindurant · 2020-10-30T13:12:32Z

No description provided.

martindurant · 2020-11-02T14:24:10Z

Note: I am not putting this into the API docs for the time being, let's see how useful it is.

scottyhq · 2020-11-20T05:59:23Z

thanks for this @martindurant . I'm still having some trouble implementing things over in intake-xarray, and am realizing I'm just not really understanding the interplay of various caching mechanisms documented here https://filesystem-spec.readthedocs.io/en/latest/features.html. Below are some simple pieces of test code and questions on usage that I'd greatly appreciate your guidance on:

First of all, naively opening and finding the mean of a single group in this 64MB HDF file on S3 takes ~2-4min wall time on my local wifi :(

import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open(uri, anon=True) as f: 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

This new keyword default_cache_type='all' is a vast improvement! (1 to 9s).

import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open(uri, anon=True, default_cache_type='all') as f: 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

But...there seems to be some bug with file closing, because if I re-run the cell I get a traceback

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/fsspec/spec.py in readinto(self, b)
   1418         """
   1419         out = memoryview(b).cast("B")
-> 1420         data = self.read(out.nbytes)
   1421         out[: len(data)] = data
   1422         return len(data)

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/fsspec/spec.py in read(self, length)
   1403             length = self.size - self.loc
   1404         if self.closed:
-> 1405             raise ValueError("I/O operation on closed file.")
   1406         logger.debug("%s read: %i - %i" % (self, self.loc, self.loc + length))
   1407         if length == 0:

ValueError: I/O operation on closed file.

Finally, I'm wondering how using simplecache changes things, because I notice a few things using the simplecache:: prefix.

first computation takes a bit longer 20-40s. Second run cache is clearly noticeable w/ 200ms wall times. It's not clear to me if simplecache is actually writing a temporary file to local disk or keeping things in memory?
I had to change anon=True to s3={'anon': True} or else I get a ListObjectsV2 operation: Access Denied

%%time 
import fsspec
import xarray as xr
uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
with fsspec.open('simplecache::'+uri, s3={'anon': True}, default_cache_type='all') as f: #anon=True 
    da = xr.open_dataset(f, group='gt1l/land_ice_segments', engine='h5netcdf')
    print(da.h_li.mean())

Expose AllBytes cache

ca679d5

martindurant mentioned this pull request Oct 30, 2020

How to force intake.open_netcdf(uri) to read all bytes up front with storage options intake/intake-xarray#88

Closed

Fix args ordering

464ba9d

martindurant merged commit dd9a8b4 into fsspec:master Nov 2, 2020

martindurant deleted the all_cache branch November 2, 2020 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose AllBytes cache #462

Expose AllBytes cache #462

martindurant commented Oct 30, 2020

martindurant commented Nov 2, 2020

scottyhq commented Nov 20, 2020

Expose AllBytes cache #462

Expose AllBytes cache #462

Conversation

martindurant commented Oct 30, 2020

martindurant commented Nov 2, 2020

scottyhq commented Nov 20, 2020