Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: moving from NetCDF4 to xarray #51

Closed
ks905383 opened this issue Nov 13, 2024 · 2 comments
Closed

Suggestion: moving from NetCDF4 to xarray #51

ks905383 opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ks905383
Copy link

Having data available in NetCDF4 format is incredibly useful for both cross-compatibility with much of climate/weather analysis but also for the tools built around it. However, a lot of this analysis is more easily (and more commonly) done in xarray (docs, which builds on netCDF4 + numpy + pandas) these days rather than using the netCDF4 package itself + numpy.

I could see a benefit to the user for the default output of api.get_data(...use_opendap=True) to return an xarray.Dataset rather than a netCDF4 object.

Without changing anything else, the easiest thing to do would be to use xarray's built-in converter:

import xarray as xr
nc = api.get_data(station_ids=["tplm2",'44025'], modes=["stdmet"], start_time="2019-06-01", end_time="2024-06-01", use_opendap=True)
ds = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))

But the real benefit would be something that transfers some of the metadata from the stations info into the xarray object (and therefore, to any netcdf file saved from it). Something like this:

ds = stdmet_df.reset_index().set_index(['station_id','timestamp']).to_xarray()

metas = []
for sid in ds.station_id.values:
    # Get station metadata
    smeta = api.station(sid)

    # Create empty Dataset with dimension of station_id
    ds_meta = xr.Dataset(coords = {'station_id':[sid]})

    for attr in smeta:
        if attr == 'Location':
            for dim,dirc in zip(['lat','lon'],['NS','EW']):
                # Find lat / lon in string
                dim_set = re.search(r'[0-9]*\.[0-9]*\ ['+dirc+']',smeta['Location']).group(0)
                # Turn into float (multiplied by +1 / -1 depending on whether it's N/E or S/W)
                dim_set = float(re.search(r'[0-9]*\.[0-9]*',dim_set).group(0))*(1 if dirc[0] in dim_set else -1)
                # Add to dataset
                ds_meta[dim] = (['station_id'],[dim_set])
        elif re.search(r'^[0-9]+\.{0,1}[0-9]*',smeta[attr]) is not None:
            # Assume everythign of the form '##.## lorem ipsum' is split
            # into a value and units
            # Get units as the bit after the numebr
            units = re.split(r'^[0-9]+\.{0,1}[0-9]*\ ',smeta[attr])[-1]
            # Get the value
            value = float(re.search(r'^[0-9]+\.{0,1}[0-9]*',smeta[attr]).group(0))
            ds_meta[re.sub(r'\ ','_',attr).lower()] = (['station_id'],[value])
            ds_meta[re.sub(r'\ ','_',attr).lower()].attrs['units'] = units
        else:
            # If doesn't fit either paradigm, just copy in as string
            ds_meta[attr] = (['station_id'],[smeta[attr]])

    metas.append(ds_meta)

# Concatenate across station ids
metas = xr.concat(metas,dim='station_id')

# Merge with original dataset, make coordinates
ds = xr.merge([ds,metas]).set_coords([k for k in metas])
print(ds)
Bildschirmfoto 2024-11-13 um 3 46 47 PM

Then it can either be saved as a netcdf using ds.to_netcdf(fn) or further analyzed using xarray's tools.

This is totally a suggestion and I clearly got a little sidetracked with it, so feel free to take it or leave it (but I think this will make a big difference in terms of usability in workflows using array data).

This is related to JOSS review openjournals/joss-reviews#7406.

@CDJellen
Copy link
Owner

Thank you for the suggestion @ks905383 this looks excellent. At a minimum, the switch over to xarray as a wrapper for the netCDF4 datasets is worth implementing. I initially attempted this directly through xarray, but was unable to load the data from URL without the extra steps of reading it to a local temp file. In terms of including the station metadata, the snippet you posted speaks for itself. I'll work on implementing this, possibly with an include_metadata flag or similar. I'll also try to make sure the behavior maps well to xarray.concat.

The latter might take some time to implement, but the first will be included in the next release.

@CDJellen CDJellen self-assigned this Nov 15, 2024
@CDJellen CDJellen added the enhancement New feature or request label Nov 15, 2024
@CDJellen
Copy link
Owner

Thank you again for these recommendations! The changes are included in #52 and the release to PyPi and Conda is from #53.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants