-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using the intake-esm library #31
Comments
I started sniffing around |
Interesting project, but it doesn't support iris and custom DRS at the moment, https://intake-esm.readthedocs.io/en/latest/index.html |
Example using intake_iris: import intake
cat = intake.Catalog("catalog.yaml")
source = cat['CMIP5'].get(short_name='ta', mip='Amon', dataset='MPI-ESM-MR', exp='historical', ensemble='r1i1p1')
cubes = source.to_dask() where catalog.yaml looks like metadata:
version: 1
sources:
CMIP5:
description: ''
driver: intake_iris.NetCDFSource
args:
urlpath: '/home/bandela/esmvaltool_input/{{short_name}}_{{mip}}_{{dataset}}_{{exp}}_{{ensemble}}*.nc'
parameters:
dataset:
description: Dataset
type: str
short_name:
description: Short name
type: str
mip:
description: MIP table
type: str
exp:
description: Experiment
type: str
ensemble:
description: Ensemble
type: str I'm not sure if this is the right way to use it though. Structured file paths also look useful, but are not supported by the intake_iris plugin. |
hello, since this issue is still open, i just wanted to let you know (maybe as an update to this issue) that there is also the import intake
# local catalog at DKRZ (updated daily)
url = '/work/ik1017/Catalogs/mistral-cmip6.json'
dataframe = intake.open_esm_datastore(url)
models = dataframe.search(experiment_id='historical',
table_id='6hrLev',
variable_id='ta',
source_id='MPI-ESM1-2-HR',
institution_id='MPI-M',
member_id='r10i1p1f1')
filepathes = list(models.df.sort_values(by=['time_range'])['path']) So it's build on top of Another thing is (probably more complicated) to intake esm catalogs for remote access to CMIP data in the google cloud (see PANGEO catalog). However, that only supports xarray datasets, so i am not sure if that's feasible, but accessing data remotely (with opendap) is really well advanced within PANGEO community and you can always convert xarray datasets to iris cubes... |
Thanks! It looks like Found the following publicly available intake-esm catalogs: |
I wanted just to dump some more thoughts on here, since i now use intake a lot. We have now again common diagnostics in the Cordex community which is always a pain to share although data is cmorized.. Implementing them in esmvaltool as diagnostics would be quite straightforward, however, for me the showstopper is data access. Not all datasets are always available at all ESGF nodes, however, opendap access is working now quite nicely for me. DKRZ also provides open_dap url now in its intake catalg besides the python-esgf client. So in short, is there some work ongoing in having a more abstract data_finder that might also allow finding filenames (or opendap urls) from different sources than the drs, e.g., intake catalog or python esgf client or pangeo cloud? Would be low hanging fruit I guess and I could run diagnostics a little more independent from my local data access without downloading... |
Hey @larsbuntemeyer thanks for the update! I recently started looking into this, and I really like the idea. If you ask me it's just a matter of time before we will get to it... so your input is very valuable. |
@larsbuntemeyer Do you know if there is an intake-esm catalog with OpenDAP URIs from one or more ESGF nodes available somewhere? |
Yes @bouweandela, that's a good point. I only know about DKRZ, they provide intake-esm catalogs in their pool at In general, I like the idea of having the open_dap urls from ESGF publicly available from an intake catalog instead from using the Maybe @agstephens or @wachsylon might have an idea...? |
@bouweandela did you not mention that Intake will be retired or am I being goldfish-memoried again? |
awesome, cheers, man! Lemme know if you need any help with that! |
In a recent discussion it emanated that probably intake will be supported longer by the esgf community, but likely not with intake-esm, which is seen as a bit too ad-hoc/experimental, but rather via intake-stac, which should mesh nicely with the new STAC based search API on ESGF. As such, I would propose to postpone this feature (and #1218 with it) to a later point. We can move it to 2.5.0 for the next round of evaluation, though that might still be too early. @bouweandela, what do you think? |
I have no time to finish this before v2.4, so I'm fine with moving this to 2.5. Are there already any STAC intake catalogs available for CMIP data? I think it's important to support whatever catalog type is currently available online and on ESGF nodes. |
Just a +1 from me on this: making it possible to use ESMValTool with catalogues defined in arbitrary places (e.g. AWSCloud) would be a huge win |
The issue with intake-esm, as @zklaus already mentioned, is that it is not based on community standards. Therefore it looks like ESGF might be using a STAC-based search API in the future. A demo of that should be available soon and I'm planning to try that out with the ESMValCore once it's available. Another issue I encountered when testing intake-esm in #1218 is that the entire catalog is stored in a CSV file and this file grows pretty big if you have many files. For example, just loading the CMIP6 catalog for the data available on Mistral at DKRZ used to take about 2 minutes if I remember correctly. If you have many files, the catalog could even become too big to fit into memory. 2 minutes might be acceptable if you're running a large recipe, but if you just need to look up a few files this might put users off. Maybe @zklaus is right and it would be best to see in what direction ESGF is going before investing time in intake-esm. On the other hand, it should not be too difficult to integrate support for it. It would however require some thought on how we structure our input data sources in the config-user/config-developer file and in the code, while maintaining backward compatibility. The current rootpath/drs setup has its share of problems and tacking on yet another data source that needs configuration might complicate that even more, especially for new users. Unfortunately, I do not have time/funding to look into this at the moment. |
These are all good points. I guess any steps which would allow more cloud-based, zarr formats to be supported in future would be great as on disk files won't be the only access pattern for long. We might have time/motivation to do some experimentation on our side later in the year, will come back to this discussion (or open a new one) if/when that happens. |
Awesome! Zarr support needs to start with iris, but we will probably have some discussions with the iris developers around the efficient use of dask soon. It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join? |
Ok great. If there's a mailing list or whatever we can keep an eye on that would be great. Would be interested to join, but time zones can be tricky of course (when we're in Aus) |
I'd love to chat more on Zarr and others, but that seems slightly off-topic here, so I opened #1621. |
Might it be worth considering https://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_iris.html |
Thanks for the suggestion @agstephens! I will check with the iris developers what their plans are regarding support for zarr. The iris developers were at some point planning to start building iris on top of xarray instead of using their own code for loading data etc, but did not do this yet because they were encountering performance issues with xarray. I think we may hit the same performance issues if we do this, but did not investigate in-depth due to lack of time so far.
@znicholls You should have received an email with an invitation to join a meeting with the iris developers on this topic. Did you get it? If you cannot make it this time, we can try scheduling earlier next time. |
Yep I got it but will be halfway back to Australia at that time unfortunately |
+1 on this. We are indexing our experiments using an intake-esm catalog. @dougiesquire. Happy to join meetings. |
For reference: here is a prototype ESGF STAC API and Jupyter notebook that shows how it could be used. |
Related projects intake-stac and intake-esgf |
It could be interesting to replace our home grown esmvaltool/_data_finder.py module by the much more advanced
intake
library and the associatedintake-iris
plugin.The text was updated successfully, but these errors were encountered: