Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured numpy arrays, xarray and netCDF(4) #1626

Open
tfurf opened this issue Oct 11, 2017 · 6 comments
Open

Structured numpy arrays, xarray and netCDF(4) #1626

tfurf opened this issue Oct 11, 2017 · 6 comments

Comments

@tfurf
Copy link

tfurf commented Oct 11, 2017

I'm trying to use xarray as the underlying container for some data processing tasks. Part of the pipeline includes processing from non-standard/easily readable formats (e.g. ROS messages) to standard formats, e.g. netCDF(4). The data I tend to be working on is time series data that is structured, which maps pretty well to structured numpy arrays using dtype manipulations. And xarray lightly wraps numpy, and provides netCDF as a backend. However, the xarray implementation doesn't really expose this capability, supported in netCDF as 'compound data types', and in fact it fails when you try and write such a DataArray/Dataset to file (at _nc4_values_and_dtype).

So the question is, is this a reasonable feature/expectation from xarray (and thus you're receptive to contributions), or is this outside the goal/purpose (I should roll my own/use pandas/etc)?

@shoyer
Copy link
Member

shoyer commented Oct 11, 2017

It is a little challenging to make structured arrays work with all of xarray's computational tools. For example, we don't have a good way to handle missing values.

Also, in my experience, non-structured arrays are a nicer to work with in most cases, and a tool like xarray makes it pretty easy to unpack non-structured arrays into multiple arrays in a Dataset, possibly with different dimensions.

That said, we've added some work arounds in the past to ensure that structured arrays work in xarray, and I would be happy to accept contributions to write them to netCDF files. I'm sure there are others who would also find this useful.

@equaeghe
Copy link

equaeghe commented Feb 5, 2018

I'd also like to see better support for compound types, writing them for starters. I'll collect some information here:

  • In the code @tfurf linked to (_nc4_values_and_dtype), an elif needs to be added to catch structured dtypes. I think they have kind == 'V'.

  • dtype.builtin can be used to detect whether we are indeed dealing with a structured type. Namely dtype.builtin must be 0.

  • The structured type must fist be added to the netCDF4.Dataset using its method createCompoundType. This must be done recursively, with the deepest levels first.

  • The netCDF variable is created in prepare_variable, which calls _nc4_values_and_dtype. There, via self.ds we also have access to the netCDF4 Dataset to be used for the creation of the as mentioned above. However, is self.ds really the Dataset, or some NetCDF4.Group? In any case _nc4_values_and_dtype and its use in prepare_variable needs to be refactored, because we need access to the underlying netCDF4 Dataset.

Is there anything I've missed? Can someone shed light on whether self.ds in prepare_variable can be assumed to the underlying netCDF4 Dataset?

@lamorton
Copy link

lamorton commented Oct 4, 2018

I just got bit with this as well. I was basically using tuples of indices as coordinates in order to implement a multidimensional sparse array .

My workaround is to use plain dimension index_dim to index the points in the N-dimensional space that I actually populate, and to have several coordinates (say X,Y) that all have index_dim as their only dimension. It's easy enough to see what the coordinates are once you select a value along index_dim, but I have to go outside xarray to locate a populated point based on it's X,Y-coordinates, because I can't slice along those arrays as (A) they aren't aliased to a dimension (B) they have non-unique values.

I've come up with an ugly method for selecting by tuples of X,Y-coordinates:

pairs = zip(x_wanted,y_wanted)

pair2index = {(dataset.x[i].item(), dataset.y[i].item()):i for i in dataset.index_dim.data}

try:

     found_indices = [pair2index[p] for p in pairs]

     found = dataset.isel(index_dim=found_indices)

except KeyError:

     print "Coordinate {} not found in dataset.".format(p)

     raise

@aldanor
Copy link

aldanor commented Sep 4, 2020

This is an ancient issue, but still - wondering if anyone here managed to hack together some workarounds?

@stale
Copy link

stale bot commented Apr 27, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 27, 2022
@equaeghe
Copy link

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

Still relevant.

@stale stale bot removed the stale label Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants