-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can xpublish serve Datasets dynamically? #75
Comments
It is not really possible right now but it shouldn't be hard to support that.
Unfortunately, the collection of Datasets is currently converted to a dictionary when creating a new |
Thanks @benbovy, this could be a useful feature some time in the future. |
Edit: See @jr3cermak's response below on how to create and use a dataset provider plugin to override how xpublish loads datasets. So this isn't super useful in and of itself right now as the source data is zarr, but I mocked up a dynamic subclass of By overriding the dataset accessor function, and preloading the dataset IDs in the class DynamicRest(xpublish.Rest):
def __init__(self, routers=None, cache_kws=None, app_kws=None):
self._get_dataset_func = get_pangeo_forge_dataset
self._datasets = list(pangeo_forge_dataset_map().keys())
dataset_route_prefix = "/datasets/{dataset_id}"
self._app_routers = rest._set_app_routers(routers, dataset_route_prefix)
self._app = None
self._app_kws = {}
if app_kws is not None:
self._app_kws.update(app_kws)
self._cache = None
self._cache_kws = {"available_bytes": 1e6}
if cache_kws is not None:
self._cache_kws.update(cache_kws)
def pangeo_forge_datasets():
res = requests.get(recipe_runs_url)
return res.json()
def pangeo_forge_with_data():
datasets = pangeo_forge_datasets()
return [r for r in datasets if r["dataset_public_url"]]
def pangeo_forge_dataset_map():
datasets = pangeo_forge_with_data()
return {r["recipe_id"]: r["dataset_public_url"] for r in datasets}
def get_pangeo_forge_dataset(dataset_id: str) -> xr.Dataset:
dataset_map = pangeo_forge_dataset_map()
zarr_url = dataset_map[dataset_id]
mapper = fsspec.get_mapper(zarr_url)
ds = xr.open_zarr(mapper, consolidated=True)
return ds It looks like if you also overrode |
Thanks @abkfenris, that looks like it could be really useful. I am buried in other activities at the moment but hopefully I'll get a chance to come back to this. 👍 |
I've gone further down the dynamic xpublish rabbit hole, in this case exposing any gridded data from the awesome-erddap list: https://github.com/abkfenris/xpublish-erddap |
Using plugins as described by @abkfenris in #155 and turning the internal cache off, you can dynamically serve a directory of files using this as an example (server.py):
Here is the client.py:
NOTE: If you dynamically update and change datasets in place, don't use the cache. This will incur a performance penalty. But you do gain a very lightweight dynamic service. The following was completed with the server running in another terminal and was not restarted between client runs or file operations. Starting with two files:
I will copy these files.
And now see:
Now I will copy ocean2.nc over ocean3.nc inplace.
And we obtain the desired result:
|
So I have some code that does this. Basically you can dynamically serve cataloged (STAC, Intake, or another if you write a plugin) Zarr + NetCDF datasets. https://github.com/LimnoTech/Catalog-To-Xpublish I may move the organization, but searching "Catalog-To-Xpublish" should find it. My approach was to mount an Xpublish server to different endpoints representing a catalog hierarchy. If you don't care about catalog hierarchy, look at my |
Hi @jhamman, xpublish looks really neat.
Does it provide a way to serve data holdings dynamically so that you could potentially serve millions of files? This would allow users to navigate an end-point that would dynamically read and serve an
xarray Dataset
on request (rather than in advance).The text was updated successfully, but these errors were encountered: