-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to force intake.open_netcdf(uri) to read all bytes up front with storage options #88
Comments
Hi @scottyhq! Not sure if this is what you want, but here's some code below to get you started. It uses simplecache so the HDF5 file will be 'downloaded' and persisted on the filesystem once (and that might be slow), but subsequent access should be fast and not require any internet connection. Probably best if you'll need to loop through all 6 laser beam groups. import intake
# Setup parameters
urlpath: str = "simplecache::s3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5"
storage_options: dict = {
"simplecache": dict(cache_storage="/tmp/ATL06", same_names=True)
}
xarray_kwargs: dict = dict(group="gt1l/land_ice_segments", engine="h5netcdf")
# Read data using intake
source: intake_xarray.netcdf.NetCDFSource = intake.open_netcdf(
urlpath=urlpath, storage_options=storage_options, xarray_kwargs=xarray_kwargs
)
dataarray: xr.Dataset = source.to_dask()
print(dataarray.h_li.mean()) produces:
Didn't realize too that ITS Live has an ICESat-2 s3 bucket, is this documented somewhere or still an internal thing? Just curious because this opens up a whole lot of possibilities on IcePyx! |
There is a caching type that does this, but it's not exposed: fsspec.caching.AllBytes. With fsspec/filesystem_spec#462 , it should be possible to give |
thanks @weiji14 , that snippet is certainly helpful. The ITSLive I2 data (just ATL06) I don't think is documented, so I'd not depend on it. But I do know NSIDC is in the process of moving to hosting on S3 this year, so it's good to explore various ways to access the data - @martindurant's PR allows everything in RAM without writing to disk, so it will be interesting to test. I imagine there will be use cases for both approaches! |
I'm not sure this is quite working. I expect the following to work: uri = 's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
da = intake.open_netcdf(uri,
xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
storage_options=dict(anon=True, default_cache_type='all'),
).to_dask()
print(da.h_li.mean()) But this leads to the following traceback: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-f7f55e6ac5c2> in <module>
2 da = intake.open_netcdf(uri,
3 xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
----> 4 storage_options=dict(anon=True, default_cache_type='all'),
5 ).to_dask()
6 print(da.h_li.mean())
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
70
71 def close(self):
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
46
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
124 """load metadata only if needed"""
125 if self._schema is None:
--> 126 self._schema = self._get_schema()
127 self.datashape = self._schema.datashape
128 self.dtype = self._schema.dtype
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
16
17 if self._ds is None:
---> 18 self._open_dataset()
19
20 metadata = {
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
77 else:
78 _open_dataset = xr.open_dataset
---> 79 url = fsspec.open_local(url, **self.storage_options)
80
81 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/fsspec/core.py in open_local(url, mode, **storage_options)
459 if not getattr(of[0].fs, "local_file", False):
460 raise ValueError(
--> 461 "open_local can only be used on a filesystem which"
462 " has attribute local_file=True"
463 )
ValueError: open_local can only be used on a filesystem which has attribute local_file=True |
Also, changing to a uri = 'simplecache::s3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
da = intake.open_netcdf(uri,
xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
storage_options=dict(anon=True, default_cache_type='all',
simplecache=dict(cache_storage="/tmp/atl06", same_names=True),
)
).to_dask()
print(da.h_li.mean()) But, I'd prefer to avoid writing to disk if possible and just stream all the bytes into memory as described in the first comment of this issue. |
right, you need Didn't #82 allow the use of file-likes for netcdf? Or does that logic need to be extended? |
Thanks for the clarification @martindurant , I can confirm the following works, my remaining question is if import intake
uri = 'simplecache::s3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5'
da = intake.open_netcdf(uri,
xarray_kwargs=dict(group='gt1l/land_ice_segments', engine='h5netcdf'),
storage_options=dict(s3={'anon': True}, default_cache_type='all',
#simplecache=dict(cache_storage="/tmp/atl06", same_names=True),
)
).to_dask()
print(da.h_li.mean())
That PR only dealt with raster.py, didn't touch netcdf.py. |
It is written to your temporary file store location, i.e., whatever tempfile.mkdtemp returns, as a file. This is usually on a disk, but might be held in memory - this is down to the OS configuration. |
While it's possible to read HDF files as file-like objects, it's not always a great idea because random access ends up being extremely slow (orders of magnitude in some cases). It's often better to just pull all the bytes of the file into a memory or temporary file. I'm trying to figure out an easily way to do this with intake.open_netcdf() via
storage_options
, but failing. There are some nice suggestions for HTTP here #56... what about S3FS? This issue is also related #86, but here I think it'd be useful to ignore dask for a minute and just deal with a single file.The goal is to do the following efficiently (what goes into storage_options?) @martindurant @weiji14 perhaps you know how to accomplish this?
Here are some timing details on different access patterns
pull all the bytes (1 second)
lazy initial read (16 seconds!)
The text was updated successfully, but these errors were encountered: