Consider using the intake-esm library #31

bouweandela · 2019-04-25T13:17:50Z

It could be interesting to replace our home grown esmvaltool/_data_finder.py module by the much more advanced intake library and the associated intake-iris plugin.

The text was updated successfully, but these errors were encountered:

valeriupredoi · 2019-05-22T16:21:15Z

I started sniffing around intake - easy to install via conda and looks easy to use (I ran the little exaple here docs but it looks to me this something accountants or auditors would rather use than us...maybe I'll find something cooler the more I look at it

bouweandela · 2019-08-30T11:14:21Z

Interesting project, but it doesn't support iris and custom DRS at the moment, https://intake-esm.readthedocs.io/en/latest/index.html

bouweandela · 2019-08-30T11:44:06Z

Example using intake_iris:

import intake
cat = intake.Catalog("catalog.yaml")
source = cat['CMIP5'].get(short_name='ta', mip='Amon', dataset='MPI-ESM-MR', exp='historical', ensemble='r1i1p1')
cubes = source.to_dask()

where catalog.yaml looks like

metadata:
  version: 1

sources:

  CMIP5:
    description: ''
    driver: intake_iris.NetCDFSource
    args:
      urlpath: '/home/bandela/esmvaltool_input/{{short_name}}_{{mip}}_{{dataset}}_{{exp}}_{{ensemble}}*.nc'
    parameters:
      dataset:
        description: Dataset
        type: str
      short_name:
        description: Short name
        type: str
      mip:
        description: MIP table
        type: str
      exp:
        description: Experiment
        type: str
      ensemble:
        description: Ensemble
        type: str

I'm not sure if this is the right way to use it though. Structured file paths also look useful, but are not supported by the intake_iris plugin.

larsbuntemeyer · 2020-10-05T18:23:38Z

hello, since this issue is still open, i just wanted to let you know (maybe as an update to this issue) that there is also the intake-esm plugin build on top of intake that will give you access to filepathes. I have a good experience using it at DKRZ, e.g., there you can get filepathes like this:

import intake

# local catalog at DKRZ (updated daily)
url = '/work/ik1017/Catalogs/mistral-cmip6.json'

dataframe = intake.open_esm_datastore(url)

models = dataframe.search(experiment_id='historical',
                              table_id='6hrLev',
                              variable_id='ta',
                              source_id='MPI-ESM1-2-HR',
                              institution_id='MPI-M',
                              member_id='r10i1p1f1')

filepathes = list(models.df.sort_values(by=['time_range'])['path'])

So it's build on top of pandas dataframes (models.df) basically holding a row for each file and column for cf attributes (that determine drs). So you can use it to find filepathes quite easily independent of the actual drs (as long as the datacenter provides the catalog).

Another thing is (probably more complicated) to intake esm catalogs for remote access to CMIP data in the google cloud (see PANGEO catalog). However, that only supports xarray datasets, so i am not sure if that's feasible, but accessing data remotely (with opendap) is really well advanced within PANGEO community and you can always convert xarray datasets to iris cubes...

bouweandela · 2020-10-05T19:28:25Z

Thanks! It looks like intake-esm has evolved a lot since I looked at it a year ago #31 (comment)!

Found the following publicly available intake-esm catalogs:

larsbuntemeyer · 2020-11-18T12:11:35Z

I wanted just to dump some more thoughts on here, since i now use intake a lot. We have now again common diagnostics in the Cordex community which is always a pain to share although data is cmorized.. Implementing them in esmvaltool as diagnostics would be quite straightforward, however, for me the showstopper is data access. Not all datasets are always available at all ESGF nodes, however, opendap access is working now quite nicely for me. DKRZ also provides open_dap url now in its intake catalg besides the python-esgf client. So in short, is there some work ongoing in having a more abstract data_finder that might also allow finding filenames (or opendap urls) from different sources than the drs, e.g., intake catalog or python esgf client or pangeo cloud? Would be low hanging fruit I guess and I could run diagnostics a little more independent from my local data access without downloading...

Peter9192 · 2020-11-18T16:08:23Z

Hey @larsbuntemeyer thanks for the update! I recently started looking into this, and I really like the idea. If you ask me it's just a matter of time before we will get to it... so your input is very valuable.

bouweandela · 2021-05-14T16:43:06Z

accessing data remotely (with opendap)

@larsbuntemeyer Do you know if there is an intake-esm catalog with OpenDAP URIs from one or more ESGF nodes available somewhere?

larsbuntemeyer · 2021-05-14T19:46:20Z

accessing data remotely (with opendap)

@larsbuntemeyer Do you know if there is an intake-esm catalog with OpenDAP URIs from one or more ESGF nodes available somewhere?

Yes @bouweandela, that's a good point. I only know about DKRZ, they provide intake-esm catalogs in their pool at /pool/data/Catalogs/ which are updated regularly. The Cordex catalog also provides open_dap urls (we asked for this some time ago explicitly). However, the cmip catalogs also contain the open_dap column but no data so far. I think you can, at least for DKRZ, simply construct the open_dap url from the path column by replacing the root path with http://esgf1.dkrz.de/thredds/dodsC. But that's probably just a hacky solution. Also, these catalogs are not available publicly outside DKRZ which probably would make sense.

In general, I like the idea of having the open_dap urls from ESGF publicly available from an intake catalog instead from using the esgf-pyclient which is rather tricky to use. cmip data is accessible without restriction anyway, with cordex i hope it will also become accessible more easily soon...

Maybe @agstephens or @wachsylon might have an idea...?

valeriupredoi · 2021-08-24T11:58:04Z

@bouweandela did you not mention that Intake will be retired or am I being goldfish-memoried again?

bouweandela · 2021-08-24T12:29:19Z

I believe @zklaus mentioned something like that, but if I remember correctly, that applied only to a very specific use case. If I have the time, I plan to finish #1218 so esmvaltool supports it.

valeriupredoi · 2021-08-24T13:58:40Z

awesome, cheers, man! Lemme know if you need any help with that!

zklaus · 2021-09-14T12:24:43Z

In a recent discussion it emanated that probably intake will be supported longer by the esgf community, but likely not with intake-esm, which is seen as a bit too ad-hoc/experimental, but rather via intake-stac, which should mesh nicely with the new STAC based search API on ESGF. As such, I would propose to postpone this feature (and #1218 with it) to a later point. We can move it to 2.5.0 for the next round of evaluation, though that might still be too early.

@bouweandela, what do you think?

bouweandela · 2021-09-14T12:48:20Z

I have no time to finish this before v2.4, so I'm fine with moving this to 2.5. Are there already any STAC intake catalogs available for CMIP data? I think it's important to support whatever catalog type is currently available online and on ESGF nodes.

znicholls · 2022-06-06T14:43:02Z

Just a +1 from me on this: making it possible to use ESMValTool with catalogues defined in arbitrary places (e.g. AWSCloud) would be a huge win

bouweandela · 2022-06-07T09:45:11Z

The issue with intake-esm, as @zklaus already mentioned, is that it is not based on community standards. Therefore it looks like ESGF might be using a STAC-based search API in the future. A demo of that should be available soon and I'm planning to try that out with the ESMValCore once it's available.

Another issue I encountered when testing intake-esm in #1218 is that the entire catalog is stored in a CSV file and this file grows pretty big if you have many files. For example, just loading the CMIP6 catalog for the data available on Mistral at DKRZ used to take about 2 minutes if I remember correctly. If you have many files, the catalog could even become too big to fit into memory. 2 minutes might be acceptable if you're running a large recipe, but if you just need to look up a few files this might put users off.

Maybe @zklaus is right and it would be best to see in what direction ESGF is going before investing time in intake-esm. On the other hand, it should not be too difficult to integrate support for it. It would however require some thought on how we structure our input data sources in the config-user/config-developer file and in the code, while maintaining backward compatibility. The current rootpath/drs setup has its share of problems and tacking on yet another data source that needs configuration might complicate that even more, especially for new users. Unfortunately, I do not have time/funding to look into this at the moment.

znicholls · 2022-06-07T09:51:16Z

These are all good points. I guess any steps which would allow more cloud-based, zarr formats to be supported in future would be great as on disk files won't be the only access pattern for long. We might have time/motivation to do some experimentation on our side later in the year, will come back to this discussion (or open a new one) if/when that happens.

bouweandela · 2022-06-07T10:10:31Z

Awesome! Zarr support needs to start with iris, but we will probably have some discussions with the iris developers around the efficient use of dask soon. It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

znicholls · 2022-06-07T10:11:49Z

It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

Ok great. If there's a mailing list or whatever we can keep an eye on that would be great. Would be interested to join, but time zones can be tricky of course (when we're in Aus)

zklaus · 2022-06-07T10:28:30Z

I'd love to chat more on Zarr and others, but that seems slightly off-topic here, so I opened #1621.

agstephens · 2022-06-07T10:40:44Z

Awesome! Zarr support needs to start with iris, but we will probably have some discussions with the iris developers around the efficient use of dask soon. It seems likely that support for the zarr format will come up then too. If you're interested you would be welcome to join?

Might it be worth considering xarray to harness the zarr backend - with the ability to export to iris, e.g.:

https://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_iris.html

bouweandela · 2022-06-27T08:05:15Z

Thanks for the suggestion @agstephens! I will check with the iris developers what their plans are regarding support for zarr. The iris developers were at some point planning to start building iris on top of xarray instead of using their own code for loading data etc, but did not do this yet because they were encountering performance issues with xarray. I think we may hit the same performance issues if we do this, but did not investigate in-depth due to lack of time so far.

Would be interested to join, but time zones can be tricky of course (when we're in Aus)

@znicholls You should have received an email with an invitation to join a meeting with the iris developers on this topic. Did you get it? If you cannot make it this time, we can try scheduling earlier next time.

znicholls · 2022-06-27T12:04:34Z

@znicholls You should have received an email with an invitation to join a meeting with the iris developers on this topic. Did you get it? If you cannot make it this time, we can try scheduling earlier next time.

Yep I got it but will be halfway back to Australia at that time unfortunately

rbeucher · 2023-07-07T02:26:52Z

+1 on this. We are indexing our experiments using an intake-esm catalog. @dougiesquire.
I think having that support in ESMValTool would be a huge win.

Happy to join meetings.

bouweandela · 2024-02-05T10:44:39Z

For reference: here is a prototype ESGF STAC API and Jupyter notebook that shows how it could be used.

bouweandela · 2024-02-08T07:31:35Z

Related projects intake-stac and intake-esgf

mattiarighi assigned valeriupredoi Jun 11, 2019

mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019

mattiarighi added the enhancement New feature or request label Jun 11, 2019

bouweandela mentioned this issue Jul 3, 2019

Improve user friendliness of data finding (missing data) #138

Closed

bouweandela mentioned this issue Aug 23, 2019

Enhance error when input files are not found #98

Merged

bouweandela assigned stefsmeets, nielsdrost, Peter9192 and bouweandela Oct 22, 2020

This was referenced Jan 14, 2021

Add warning about usability of run command line flags #936

Open

How do we treat reanalysis with members? #945

Open

bouweandela mentioned this issue May 14, 2021

Distributed ESMValTool #1128

Closed

bouweandela added this to the v2.4.0 milestone May 19, 2021

bouweandela mentioned this issue Jun 9, 2021

Support OpenDAP access to ESGF #1131

Closed

bouweandela changed the title ~~Consider using the intake library~~ Consider using the intake-esm library Jun 21, 2021

bouweandela linked a pull request Jul 6, 2021 that will close this issue

Add support for intake-esm #1218

Draft

10 tasks

bouweandela modified the milestones: v2.4.0, v2.5.0 Sep 14, 2021

bouweandela removed this from the v2.5.0 milestone Feb 3, 2022

Peter9192 removed their assignment Jun 28, 2022

bouweandela added this to ESiWACE3 ESMValTool service Feb 21, 2024

bouweandela mentioned this issue Apr 3, 2024

New configuration file plan #2371

Open

bouweandela moved this to Todo in ESiWACE3 ESMValTool service Jun 10, 2024

stefsmeets removed their assignment Aug 6, 2024

rbeucher mentioned this issue Sep 22, 2024

Add Support for Intake ESM Catalogues in ESMValCore / ESMValTool ACCESS-NRI/ESMValTool-workflow#193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using the intake-esm library #31

Consider using the intake-esm library #31

bouweandela commented Apr 25, 2019

valeriupredoi commented May 22, 2019 •

edited

Loading

bouweandela commented Aug 30, 2019

bouweandela commented Aug 30, 2019

larsbuntemeyer commented Oct 5, 2020 •

edited

Loading

bouweandela commented Oct 5, 2020 •

edited

Loading

larsbuntemeyer commented Nov 18, 2020 •

edited

Loading

Peter9192 commented Nov 18, 2020

bouweandela commented May 14, 2021

larsbuntemeyer commented May 14, 2021 •

edited

Loading

valeriupredoi commented Aug 24, 2021

bouweandela commented Aug 24, 2021

valeriupredoi commented Aug 24, 2021

zklaus commented Sep 14, 2021

bouweandela commented Sep 14, 2021

znicholls commented Jun 6, 2022

bouweandela commented Jun 7, 2022

znicholls commented Jun 7, 2022

bouweandela commented Jun 7, 2022

znicholls commented Jun 7, 2022

zklaus commented Jun 7, 2022

agstephens commented Jun 7, 2022

bouweandela commented Jun 27, 2022 •

edited

Loading

znicholls commented Jun 27, 2022

rbeucher commented Jul 7, 2023

bouweandela commented Feb 5, 2024

bouweandela commented Feb 8, 2024

Consider using the intake-esm library #31

Consider using the intake-esm library #31

Comments

bouweandela commented Apr 25, 2019

valeriupredoi commented May 22, 2019 • edited Loading

bouweandela commented Aug 30, 2019

bouweandela commented Aug 30, 2019

larsbuntemeyer commented Oct 5, 2020 • edited Loading

bouweandela commented Oct 5, 2020 • edited Loading

larsbuntemeyer commented Nov 18, 2020 • edited Loading

Peter9192 commented Nov 18, 2020

bouweandela commented May 14, 2021

larsbuntemeyer commented May 14, 2021 • edited Loading

valeriupredoi commented Aug 24, 2021

bouweandela commented Aug 24, 2021

valeriupredoi commented Aug 24, 2021

zklaus commented Sep 14, 2021

bouweandela commented Sep 14, 2021

znicholls commented Jun 6, 2022

bouweandela commented Jun 7, 2022

znicholls commented Jun 7, 2022

bouweandela commented Jun 7, 2022

znicholls commented Jun 7, 2022

zklaus commented Jun 7, 2022

agstephens commented Jun 7, 2022

bouweandela commented Jun 27, 2022 • edited Loading

znicholls commented Jun 27, 2022

rbeucher commented Jul 7, 2023

bouweandela commented Feb 5, 2024

bouweandela commented Feb 8, 2024

valeriupredoi commented May 22, 2019 •

edited

Loading

larsbuntemeyer commented Oct 5, 2020 •

edited

Loading

bouweandela commented Oct 5, 2020 •

edited

Loading

larsbuntemeyer commented Nov 18, 2020 •

edited

Loading

larsbuntemeyer commented May 14, 2021 •

edited

Loading

bouweandela commented Jun 27, 2022 •

edited

Loading