Replies: 3 comments 5 replies
-
I like the idea. Until we have chunking info available as metadata, we should allow users to override our percentage-based calculation with whatever value they want in case they want to optimize for their particular dataset. |
Beta Was this translation helpful? Give feedback.
-
A dev note: I'm running into an interesting behavior with xarray and dask distributed, recently a PR got merged into import earthaccess
import xarray as xr
auth = earthaccess.login()
results = earthaccess.search_data(short_name="MUR25-JPL-L4-GLOB-v04.2", count=2)
# I'm testing it with fileset = earthaccess.open(results, smart_open=True) but this code is not in earthaccess yet
fileset = earthaccess.open(results)
ds = xr.open_dataset(fileset[0], engine="h5netcdf")
# now we can inspect fileset[0] to track the IO and caching stats
fileset[0].cache We will get an output like: <BlockCache:
block size : 102400
block count : 19
file size : 1878361
cache hits : 75
cache misses: 1
total requested bytes: 102400> However, if we run # this should use a Dask cluster if we have one.
ds = xr.open_mfdataset(fileset,
engine="h5netcdf",
compat="override",
coords="minimal",
parallel=True) and now if we inspect <BlockCache:
block size : 102400
block count : 19
file size : 1895805
cache hits : 0
cache misses: 0
total requested bytes: 0> meaning our actual file-like object hasn't been used internally by xarray/dask. I'll keep debugging this and open an issue/discussion in xarray when I get more information on what's happening. Maybe @dcherian has a better idea of what may be happening here. |
Beta Was this translation helpful? Give feedback.
-
cc @kmuehlbauer, it might be interesting to revive the |
Beta Was this translation helpful? Give feedback.
-
One of the core features in
earthaccess
is accessing remote files without having to download them withearthaccess.open()
. Under the hood we are using fsspec. The default cache is called read-ahead, this works fine for text files (e.g. when we are read contiguous lines of text) but... read-ahead is very inefficient for scientific data (HDF/NetCDF).What are the alternatives? Fortunately fsspec has different caching strategies and 2 of them seem better options for us in the short term:
blockcache
andfirst
, a third caching implementation (KnownPartsOfAFile) could improve this even further (down the road)..dmrpp
sidecar files if they are available or opening a file and inspecting its structure (very inefficient). If we write a parser for.dmrpp
or have this information available at the metadata level earthaccess could improve what it caches in a very efficient way.I've tested the first 2 implementations with data from several missions and the improvements in access times and data transfers are very promising, in some cases an order of magnitude improvement.
Proposal: earthaccess should use one of these 2 caching strategies depending on the file type, we should adjust the cache size to a percentage of the granule size and down the road we can work on further optimizations if some relevant information about the chunking of a given dataset is available to us via CMR/STAC.
To avoid complications with changing or breaking the API we can start prototyping this on a top level
smart_open()
method as we talked about with @itcarroll and @chuckwondo.Beta Was this translation helpful? Give feedback.
All reactions