Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add THREDDSMergedSource implementation #3

Merged
merged 5 commits into from
Nov 30, 2020

Conversation

martindurant
Copy link
Member

xref intake/intake-xarray#29 @rabernat

This example takes a THREDDS URL and a path to descend down, and calls the combine function on all of the datasets found, e.g.,

import intake
import intake_thredds
s = intake_thredds.source.THREDDSMergedSource('http://dap.nci.org.au/thredds/catalog.xml', ['eMAST TERN', 'eMAST TERN - files', 'ASCAT', 'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011', '00000000', '*.nc'])
s.to_dask()

results in (and this takes a while)

<xarray.Dataset>
Dimensions:                                 (lat: 681, lon: 841, time: 1826)
Coordinates:
  * lat                                     (lat) float64 -10.0 -10.05 ... -44.0
  * lon                                     (lon) float64 112.0 112.0 ... 154.0
  * time                                    (time) datetime64[ns] 2007-01-01 ... 2011-12-31
Data variables:
    crs                                     (time) uint8 129 129 129 ... 129 129
    lwe_thickness_of_soil_moisture_content  (time, lat, lon) float32 0.0 ... 0.0

(a randomly-chosen .nc file has 31 timepoints - days of a month, I think)

for name in cat:
if fnmatch.fnmatch(name, patterns[0]):
if len(patterns) == 1:
out.append(cat[name](chunks={}))
Copy link
Member

@andersy005 andersy005 Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example takes a THREDDS URL and a path to descend down, and calls the combine function on all of the datasets found, e.g.,

import intake
import intake_thredds
s = intake_thredds.source.THREDDSMergedSource('http://dap.nci.org.au/thredds/catalog.xml', ['eMAST TERN', 'eMAST TERN - files', 'ASCAT', 'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011', '00000000', '*.nc'])
s.to_dask()

results in (and this takes a while)

Using chunks={} appears to speed the xarray's combine up:

In [1]: import intake                                                                                              

In [2]: url = 'http://dap.nci.org.au/thredds/catalog.xml'                                                          

In [3]: paths = ['eMAST TERN', 'eMAST TERN - files', 'ASCAT', 'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011', '
   ...: 00000000', '*.nc']                                                                                         

In [4]: s = intake.open_thredds_merged(url, paths)                                                                 

In [5]: %%time 
   ...: ds = s.to_dask() 
   ...:  
   ...:                                                                                                            
Dataset(s): 100%|██████████████████████████████| 60/60 [02:36<00:00,  2.60s/it]
CPU times: user 875 ms, sys: 235 ms, total: 1.11 s
Wall time: 2min 41s

Screen Shot 2020-04-17 at 4 02 47 PM

@andersy005 andersy005 requested a review from jhamman April 17, 2020 22:13
@andersy005 andersy005 changed the title Example of using xr.auto_combine Add THREDDSMergedSource implementation Apr 17, 2020
@andersy005
Copy link
Member

@martindurant, I added a few changes to this PR. When you get a moment, can you take a look and let me know what you think.

Currently the data loading is dispatched to netcdf source, should we use the opendap source instead?

data = [ds.to_dask() for ds in tqdm(_match(cat, path), desc='Dataset(s)', ncols=79)]
else:
data = [ds.to_dask() for ds in _match(cat, path)]
self._ds = xr.combine_by_coords(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto_combine() function was deprecated in the most recent versions of xarray.

@martindurant
Copy link
Member Author

should we use the opendap source instead

because of the auth options? I don't know the typical use for thredds, so I leave it up to you. The metadata describes the endpoint as dap, if I remember.

@aaronspring
Copy link
Collaborator

I would be interested in this PR merged. Any progress here? @andersy005 @martindurant @larsbuntemeyer

@martindurant
Copy link
Member Author

I honestly don't remember where this was up to.
@aaronspring , you may be in a fine place to have an opinion.
cc @scottyhq @larsbuntemeyer for some similar discussion in intake-xarray

@martindurant
Copy link
Member Author

Alternatively, happy to merge if there are no more comments. @aaronspring , sound good?

@andersy005
Copy link
Member

I honestly don't remember where this was up to.
Alternatively, happy to merge if there are no more comments. @aaronspring , sound good?

👍🏽 for merging this as is, and addressing issues + adding new features in separate PRs....

@martindurant martindurant merged commit 0d4728e into intake:master Nov 30, 2020
@martindurant martindurant deleted the merger branch November 30, 2020 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants