Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse arrays #165

Open
jvail opened this issue Jan 27, 2021 · 13 comments
Open

Sparse arrays #165

jvail opened this issue Jan 27, 2021 · 13 comments
Labels

Comments

@jvail
Copy link

jvail commented Jan 27, 2021

👍 for the new release.

I was wondering if it would be laborious to support sparse (https://github.com/pydata/sparse) arrays. I have an issue with this line

https://github.com/benbovy/xarray-simlab/blob/master/xsimlab/stores.py#L193
dtype = getattr(value, "dtype", np.asarray(value).dtype)

I guess in case of sparse it would just be:
dtype = getattr(value, "dtype", value.dtype)

But maybe after that line there are other issues as well. I just wanted to ask asap if you think sparse could be easily supported with a few tweaks. Because if not then I might just have to skip the idea of using sparse arrays at all.

Thank you,
Jan

@benbovy
Copy link
Member

benbovy commented Feb 8, 2021

Sorry for the late reply @jvail !

Yes it would be nice if xarray-simlab could support sparse arrays (I haven't tried yet). Xarray-simlab uses zarr to store the model outputs and uses xarray.open_zarr to load back the zarr store as a xarray Dataset. So the question is whether zarr and xarray.open_zarr would easily support that. There seems to be interest in that, but I don't think it could be supported with a few tweaks: zarr-developers/zarr-python#152, pydata/sparse#222, zarr-developers/zarr-python#424.

@jvail
Copy link
Author

jvail commented Feb 8, 2021

Sorry for the late reply @jvail !

No need to be sorry!

Alright - Thank you. I'll try to read the zarr related stuff and try a dev install of simlab to see how far I can get.

@jvail
Copy link
Author

jvail commented Feb 13, 2021

Hi @benbovy ,

here is a test for some minimal sparse support: jvail@107d2e0#diff-932f643c875a75e89dbf205d2764a1ac25b260f5c12503543242f833e8359a62

It turns out it is impossible to use sparse with intent='inout' because zarr wont accept it and if you auto densify you are back at square one because you would have to "re-sparse" after reading the input.

It works well internally - meaning inside a process without having the variable either in input_vars or output_vars - without any changes in the source code.

If the sparse variable is a model output (in output_vars) you would have to call todense() on it (see my commit). Not sure if you want that. But at least it would provide a bit of support.

Another way to handle sparse in outputs could be to just drop all sparse when writing to zarr and issue a warning.

@benbovy
Copy link
Member

benbovy commented Feb 13, 2021

Hi @jvail,

Thanks for your example.

Calling todense() before writing to zarr (and issue a warning) seems reasonable to me. It's better than no support at all.

Hopefully, better support for sparse arrays (and maybe other array backends) will be fixed (in zarr and xarray) upstream in the future.

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

@jvail
Copy link
Author

jvail commented Feb 13, 2021

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

Thank you Benoit. That sounds like a pretty good idea. Should work with 'inout' as well - I'll give it a try.

@jvail
Copy link
Author

jvail commented Mar 2, 2021

Salut @benbovy,

if you find time please take a look at this commit/branch: jvail@cfa37a6

Instead of adding two zarr items - as you suggested - I tried to use the DOK format which nicely translates into a structured numpy array. That in turn is digestible by zarr.

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

There is a little notebook as well.

@benbovy
Copy link
Member

benbovy commented Mar 2, 2021

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

@jvail
Copy link
Author

jvail commented Mar 3, 2021

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

hmpf - that is beyond my reach. But maybe we can can get around it by injecting some encoding/decoding options - I hope.

@benbovy
Copy link
Member

benbovy commented Mar 8, 2021

I just had a look at your jvail@cfa37a6 branch. That's nice! I think that such support for sparse arrays would be a nice addition in xarray-simlab before proper support for sparse is available in zarr (and zarr->xarray).

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

@jvail
Copy link
Author

jvail commented Mar 9, 2021

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

Yes, that's true. I'll try to propose something more general. But it seems you are d'accord with this approach - generally speaking?
I'd be interested in moving this forward quickly so it might have a chance to be part of the next release.

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

@benbovy
Copy link
Member

benbovy commented Mar 9, 2021

Yes generally I'm d'accord !

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

I'm afraid there's currently no workaround other than mask_and_scale: False. I don't think Xarray's CFMaskCoder supports fill values defined for structured dtypes, but I guess it wouldn't be hard to support it.

@jvail
Copy link
Author

jvail commented Mar 9, 2021

I'm afraid there's currently no workaround other than mask_and_scale: False.

Apparently setting the zarr fill_value to None solves it.

@jvail
Copy link
Author

jvail commented Apr 5, 2021

It turned out that we do not desperately need sparse arrays right now. With our current model/data we stay below a "critical" size (wherever that is) with our indices. But if they grow much larger in future and therefore arrays get sparser and sparser support for sparse would be very welcome.
Also I'd like to wait until the dust has settled a bit around all the refactoring and new features in open PRs and a new version has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants