Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using the intake-esm library #31

Open
bouweandela opened this issue Apr 25, 2019 · 26 comments · May be fixed by #1218
Open

Consider using the intake-esm library #31

bouweandela opened this issue Apr 25, 2019 · 26 comments · May be fixed by #1218
Assignees
Labels
enhancement New feature or request

Comments

@bouweandela
Copy link
Member

It could be interesting to replace our home grown esmvaltool/_data_finder.py module by the much more advanced intake library and the associated intake-iris plugin.

@valeriupredoi
Copy link
Contributor

valeriupredoi commented May 22, 2019

I started sniffing around intake - easy to install via conda and looks easy to use (I ran the little exaple here docs but it looks to me this something accountants or auditors would rather use than us...maybe I'll find something cooler the more I look at it

@bouweandela
Copy link
Member Author

Interesting project, but it doesn't support iris and custom DRS at the moment, https://intake-esm.readthedocs.io/en/latest/index.html

@bouweandela
Copy link
Member Author

Example using intake_iris:

import intake
cat = intake.Catalog("catalog.yaml")
source = cat['CMIP5'].get(short_name='ta', mip='Amon', dataset='MPI-ESM-MR', exp='historical', ensemble='r1i1p1')
cubes = source.to_dask()

where catalog.yaml looks like

metadata:
  version: 1

sources:

  CMIP5:
    description: ''
    driver: intake_iris.NetCDFSource
    args:
      urlpath: '/home/bandela/esmvaltool_input/{{short_name}}_{{mip}}_{{dataset}}_{{exp}}_{{ensemble}}*.nc'
    parameters:
      dataset:
        description: Dataset
        type: str
      short_name:
        description: Short name
        type: str
      mip:
        description: MIP table
        type: str
      exp:
        description: Experiment
        type: str
      ensemble:
        description: Ensemble
        type: str

I'm not sure if this is the right way to use it though. Structured file paths also look useful, but are not supported by the intake_iris plugin.

@larsbuntemeyer
Copy link

larsbuntemeyer commented Oct 5, 2020

hello, since this issue is still open, i just wanted to let you know (maybe as an update to this issue) that there is also the intake-esm plugin build on top of intake that will give you access to filepathes. I have a good experience using it at DKRZ, e.g., there you can get filepathes like this:

import intake

# local catalog at DKRZ (updated daily)
url = '/work/ik1017/Catalogs/mistral-cmip6.json'

dataframe = intake.open_esm_datastore(url)

models = dataframe.search(experiment_id='historical',
                              table_id='6hrLev',
                              variable_id='ta',
                              source_id='MPI-ESM1-2-HR',
                              institution_id='MPI-M',
                              member_id='r10i1p1f1')

filepathes = list(models.df.sort_values(by=['time_range'])['path'])

So it's build on top of pandas dataframes (models.df) basically holding a row for each file and column for cf attributes (that determine drs). So you can use it to find filepathes quite easily independent of the actual drs (as long as the datacenter provides the catalog).

Another thing is (probably more complicated) to intake esm catalogs for remote access to CMIP data in the google cloud (see PANGEO catalog). However, that only supports xarray datasets, so i am not sure if that's feasible, but accessing data remotely (with opendap) is really well advanced within PANGEO community and you can always convert xarray datasets to iris cubes...

@bouweandela
Copy link
Member Author

bouweandela commented Oct 5, 2020

Thanks! It looks like intake-esm has evolved a lot since I looked at it a year ago #31 (comment)!

Found the following publicly available intake-esm catalogs:

@larsbuntemeyer
Copy link

larsbuntemeyer commented Nov 18, 2020

I wanted just to dump some more thoughts on here, since i now use intake a lot. We have now again common diagnostics in the Cordex community which is always a pain to share although data is cmorized.. Implementing them in esmvaltool as diagnostics would be quite straightforward, however, for me the showstopper is data access. Not all datasets are always available at all ESGF nodes, however, opendap access is working now quite nicely for me. DKRZ also provides open_dap url now in its intake catalg besides the python-esgf client. So in short, is there some work ongoing in having a more abstract data_finder that might also allow finding filenames (or opendap urls) from different sources than the drs, e.g., intake catalog or python esgf client or pangeo cloud? Would be low hanging fruit I guess and I could run diagnostics a little more independent from my local data access without downloading...

@Peter9192
Copy link
Contributor

Hey @larsbuntemeyer thanks for the update! I recently started looking into this, and I really like the idea. If you ask me it's just a matter of time before we will get to it... so your input is very valuable.

@bouweandela
Copy link
Member Author

accessing data remotely (with opendap)

@larsbuntemeyer Do you know if there is an intake-esm catalog with OpenDAP URIs from one or more ESGF nodes available somewhere?

@larsbuntemeyer
Copy link

larsbuntemeyer commented May 14, 2021

accessing data remotely (with opendap)

@larsbuntemeyer Do you know if there is an intake-esm catalog with OpenDAP URIs from one or more ESGF nodes available somewhere?

Yes @bouweandela, that's a good point. I only know about DKRZ, they provide intake-esm catalogs in their pool at /pool/data/Catalogs/ which are updated regularly. The Cordex catalog also provides open_dap urls (we asked for this some time ago explicitly). However, the cmip catalogs also contain the open_dap column but no data so far. I think you can, at least for DKRZ, simply construct the open_dap url from the path column by replacing the root path with http://esgf1.dkrz.de/thredds/dodsC. But that's probably just a hacky solution. Also, these catalogs are not available publicly outside DKRZ which probably would make sense.

In general, I like the idea of having the open_dap urls from ESGF publicly available from an intake catalog instead from using the esgf-pyclient which is rather tricky to use. cmip data is accessible without restriction anyway, with cordex i hope it will also become accessible more easily soon...

Maybe @agstephens or @wachsylon might have an idea...?

@bouweandela bouweandela added this to the v2.4.0 milestone May 19, 2021
@bouweandela bouweandela changed the title Consider using the intake library Consider using the intake-esm library Jun 21, 2021
@bouweandela bouweandela linked a pull request Jul 6, 2021 that will close this issue
10 tasks
@valeriupredoi
Copy link
Contributor

@bouweandela did you not mention that Intake will be retired or am I being goldfish-memoried again?

@bouweandela
Copy link
Member Author

I believe @zklaus mentioned something like that, but if I remember correctly, that applied only to a very specific use case. If I have the time, I plan to finish #1218 so esmvaltool supports it.

@valeriupredoi
Copy link
Contributor

awesome, cheers, man! Lemme know if you need any help with that!

@zklaus
Copy link

zklaus commented Sep 14, 2021

In a recent discussion it emanated that probably intake will be supported longer by the esgf community, but likely not with intake-esm, which is seen as a bit too ad-hoc/experimental, but rather via intake-stac, which should mesh nicely with the new STAC based search API on ESGF. As such, I would propose to postpone this feature (and #1218 with it) to a later point. We can move it to 2.5.0 for the next round of evaluation, though that might still be too early.

@bouweandela, what do you think?

@bouweandela bouweandela modified the milestones: v2.4.0, v2.5.0 Sep 14, 2021
@bouweandela
Copy link
Member Author

I have no time to finish this before v2.4, so I'm fine with moving this to 2.5. Are there already any STAC intake catalogs available for CMIP data? I think it's important to support whatever catalog type is currently available online and on ESGF nodes.

@bouweandela bouweandela removed this from the v2.5.0 milestone Feb 3, 2022
@znicholls
Copy link

Just a +1 from me on this: making it possible to use ESMValTool with catalogues defined in arbitrary places (e.g. AWSCloud) would be a huge win

@bouweandela
Copy link
Member Author

The issue with intake-esm, as @zklaus already mentioned, is that it is not based on community standards. Therefore it looks like ESGF might be using a STAC-based search API in the future. A demo of that should be available soon and I'm planning to try that out with the ESMValCore once it's available.

Another issue I encountered when testing intake-esm in #1218 is that the entire catalog is stored in a CSV file and this file grows pretty big if you have many files. For example, just loading the CMIP6 catalog for the data available on Mistral at DKRZ used to take about 2 minutes if I remember correctly. If you have many files, the catalog could even become too big to fit into memory. 2 minutes might be acceptable if you're running a large recipe, but if you just need to look up a few files this might put users off.

Maybe @zklaus is right and it would be best to see in what direction ESGF is going before investing time in intake-esm. On the other hand, it should not be too difficult to integrate support for it. It would however require some thought on how we structure our input data sources in the config-user/config-developer file and in the code, while maintaining backward compatibility. The current rootpath/drs setup has its share of problems and tacking on yet another data source that needs configuration might complicate that even more, especially for new users. Unfortunately, I do not have time/funding to look into this at the moment.

@znicholls
Copy link

These are all good points. I guess any steps which would allow more cloud-based, zarr formats to be supported in future would be great as on disk files won't be the only access pattern for long. We might have time/motivation to do some experimentation on our side later in the year, will come back to this discussion (or open a new one) if/when that happens.

@bouweandela
Copy link
Member Author

Awesome! Zarr support needs to start with iris, but we will probably have some discussions with the iris developers around the efficient use of dask soon. It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

@znicholls
Copy link

It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

Ok great. If there's a mailing list or whatever we can keep an eye on that would be great. Would be interested to join, but time zones can be tricky of course (when we're in Aus)

@zklaus
Copy link

zklaus commented Jun 7, 2022

I'd love to chat more on Zarr and others, but that seems slightly off-topic here, so I opened #1621.

@agstephens
Copy link

Awesome! Zarr support needs to start with iris, but we will probably have some discussions with the iris developers around the efficient use of dask soon. It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

Might it be worth considering xarray to harness the zarr backend - with the ability to export to iris, e.g.:

https://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_iris.html

@bouweandela
Copy link
Member Author

bouweandela commented Jun 27, 2022

Thanks for the suggestion @agstephens! I will check with the iris developers what their plans are regarding support for zarr. The iris developers were at some point planning to start building iris on top of xarray instead of using their own code for loading data etc, but did not do this yet because they were encountering performance issues with xarray. I think we may hit the same performance issues if we do this, but did not investigate in-depth due to lack of time so far.

Would be interested to join, but time zones can be tricky of course (when we're in Aus)

@znicholls You should have received an email with an invitation to join a meeting with the iris developers on this topic. Did you get it? If you cannot make it this time, we can try scheduling earlier next time.

@znicholls
Copy link

@znicholls You should have received an email with an invitation to join a meeting with the iris developers on this topic. Did you get it? If you cannot make it this time, we can try scheduling earlier next time.

Yep I got it but will be halfway back to Australia at that time unfortunately

@Peter9192 Peter9192 removed their assignment Jun 28, 2022
@rbeucher
Copy link
Contributor

rbeucher commented Jul 7, 2023

+1 on this. We are indexing our experiments using an intake-esm catalog. @dougiesquire.
I think having that support in ESMValTool would be a huge win.

Happy to join meetings.

@bouweandela
Copy link
Member Author

For reference: here is a prototype ESGF STAC API and Jupyter notebook that shows how it could be used.

@bouweandela
Copy link
Member Author

Related projects intake-stac and intake-esgf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.