Converting sparse matrices directly to persistent zarr arrays #152

gokceneraslan · 2017-08-02T15:10:50Z

I'm trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to a persistent zarr array. Here is the code I'm using:

import tables
import scipy.sparse as sp_sparse
import zarr

f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
                               f.root.mm10.indices[:],
                               f.root.mm10.indptr[:]),
                              shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)

However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (scipy/scipy#7230).

So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.

If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:

Traceback (most recent call last):
  File "convert.py", line 23, in <module>
    zarr.array(matrix, store='1m.zarr', overwrite=True)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
    z[:] = data
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
    value[value_selection])
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
    self._chunk_setitem_nosync(cidx, item, value)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
    chunk = np.ascontiguousarray(value, dtype=self._dtype)
  File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done

Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

This issue can also be reproduced easily via the following code:

import numpy as np
import scipy.sparse as sp
import zarr

x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)

The text was updated successfully, but these errors were encountered:

alimanfoo · 2017-08-02T21:17:17Z

Thank you for reporting. I'm sure there will be a way to do this without converting the entire sparse array to dense format in one go. I'm away from office at the moment but happy to look when back.

On Wed, 2 Aug 2017 at 16:10, Gökçen Eraslan ***@***.***> wrote: I'm trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to a persistent zarr array. Here is the code I'm using: import tablesimport scipy.sparse as sp_sparseimport zarr f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5') matrix = sp_sparse.csc_matrix((f.root.mm10.data[:], f.root.mm10.indices[:], f.root.mm10.indptr[:]), shape=f.root.mm10.shape[:]) matrix = matrix.todense() zarr.array(matrix, store='1m.zarr', overwrite=True) However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (scipy/scipy#7230 <scipy/scipy#7230>). So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array. If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception: Traceback (most recent call last): File "convert.py", line 23, in <module> zarr.array(matrix, store='1m.zarr', overwrite=True) File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array z[:] = data File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__ value[value_selection]) File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem self._chunk_setitem_nosync(cidx, item, value) File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync chunk = np.ascontiguousarray(value, dtype=self._dtype) File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray return array(a, dtype, copy=False, order='C', ndmin=1) ValueError: setting an array element with a sequence. Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5 This issue can also be reproduced easily via the following code: import numpy as npimport scipy.sparse as sp x = sp.csc_matrix((3, 4), dtype=np.int8) y = zarr.array(x) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/152>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QsrYrHDJOQNLjtvauNFnRyUyAPiCks5sUJF6gaJpZM4OrPOV> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-12-09T20:18:04Z

In master, there are some new features that should help with this. Have included a simplistic example below to demonstrate this. It uses a coo_matrix array. That said, it is possible to convert from the csc_matrix to the coo_matrix using the tocoo method.

In [1]: import numpy as np

In [2]: import scipy.sparse as sparse

In [3]: import zarr

In [4]: row = np.array([0, 3, 1, 0])

In [5]: col = np.array([0, 3, 1, 2])

In [6]: data = np.array([4, 5, 7, 9])

In [7]: a = sparse.coo_matrix((data, (row, col)), shape=(4, 4))

In [8]: z = zarr.zeros(shape=a.shape, chunks=4, dtype=a.dtype)

In [9]: z.set_coordinate_selection((a.row, a.col), a.data)

In [10]: z[...]
Out[10]: 
array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

Edit: Here are the docs on coordinate indexing.

jakirkham · 2018-02-14T05:17:19Z

Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report.

alimanfoo · 2018-02-19T17:22:48Z

Nice application of coordinate indexing!

On Wed, 14 Feb 2018 at 05:17, jakirkham ***@***.***> wrote: Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan <https://github.com/gokceneraslan>, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#152 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QqWDePji_1rLcI_mkgtcTq6PzJwzks5tUmxggaJpZM4OrPOV> .

-- If I do not respond to an email within a few days and you need a response, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2018-12-15T20:32:01Z

Just as an additional comment here, users can customize this approach as much as is needed. First users are able to disable compression. Second chunks can be set so as to each include a single value. For instance, based off of the example above, one can do the following.

In [1]: import zarr                                                             

In [2]: z = zarr.zeros(shape=(4, 4), chunks=(1, 1), dtype=int, compressor=None) 

In [3]: z.set_coordinate_selection(([0, 3, 1, 0], [0, 3, 1, 2]), [4, 5, 7, 9])  

In [4]: z.chunk_store                                                           
Out[4]: 
{'.zarray': b'{\n    "chunks": [\n        1,\n        1\n    ],\n    "compressor": null,\n    "dtype": "<i8",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        4,\n        4\n    ],\n    "zarr_format": 2\n}',
 '0.0': array([[4]]),
 '0.2': array([[9]]),
 '1.1': array([[7]]),
 '3.3': array([[5]])}

This amounts to an in-memory representation of a DOK sparse array and can easily be used with other stores. Thus it can trivially be written to disk, stored in databases, and/or written to a cloud-based key-value store.

Sorry if this is all already obvious, just figured I'd point it out in case it wasn't. Happy to add an example like this to the docs if it helps. :)

Minor note: Currently the in-memory representation of the chunk_store may change a little bit in the next release (e.g. using bytes objects for values ( #359 ) and possibly defaulting to a different store than dict ( #351 )), but this won't affect the availability of this functionality or how any of the other stores work.

jakirkham · 2018-12-15T21:12:05Z

Should add this is also possible with the newer Sparse library, which is gaining traction. Here's the same example as before, but with Sparse instead. Have skipped inspecting the Zarr Array as it is no different from the previous examples.

In [1]: import numpy as np                                                      

In [2]: import sparse                                                           

In [3]: import zarr                                                             

In [4]: coords = [[0, 3, 1, 0], [0, 3, 1, 2]]                                   

In [5]: data = np.array([4, 5, 7, 9])                                           

In [6]: shape = (4, 4)                                                          

In [7]: a = sparse.COO(coords, data, shape=shape)                               

In [8]: z = zarr.zeros(shape=a.shape, chunks=a.ndim*(1,), dtype=a.dtype, compres
   ...: sor=None)                                                               

In [9]: z.set_coordinate_selection(tuple(a.coords), a.data)

FWIW I'm not sure there is much value in actually working with an in-memory Zarr array in this case as Sparse already has an in-memory representation and provides many computation operations that would be useful here. That said, I do think leveraging other Zarr stores to persist Sparse arrays to disk could be interesting.

It also may be interesting to take a look at what storage options make sense for these Sparse arrays. Perhaps single file storage like DBMStore, LMDBStore or ZipStore make more sense. Alternatively there may be other storage formats that people like to use. Would be interesting to hear a bit more about real world use cases.

Edit: Have raised issue ( pydata/sparse#222 ) to see if there is some interest in supporting serialization of Sparse arrays to Zarr Arrays to streamline things a bit. Not that performing the above task is difficult, but it could be nice for users if this was a trivial line of code.

jakirkham · 2019-06-19T19:36:09Z

cc @ryan-williams (in case this and/or the xrefs are of interest)

rabernat · 2019-06-19T19:54:26Z

👍 following. I am currently using the sparse package for ocean model transport matrix diagnostics. I would love to be able to efficiently serialize these to zarr.

alimanfoo added the enhancement New features or improvements label Nov 21, 2017

rth mentioned this issue Jan 25, 2018

Sparse arrays support dask/dask-ml#123

Open

jakirkham closed this as completed Feb 14, 2018

rth mentioned this issue Oct 11, 2018

Efficient sparse arrays storage FreeDiscovery/FreeDiscovery#44

Closed

jakirkham mentioned this issue Dec 15, 2018

Storing Sparse arrays to Zarr pydata/sparse#222

Open

daletovar mentioned this issue Apr 1, 2019

Interest in a zarr.sparse module? #424

Open

benbovy mentioned this issue Feb 8, 2021

Sparse arrays xarray-contrib/xarray-simlab#165

Open

jan-engelmann mentioned this issue Jan 24, 2023

Large number of dataframe columns cause hdf5 write error: Unable to create attribute (object header message is too large) scverse/anndata#874

Closed

3 tasks

elyall mentioned this issue May 25, 2023

Adding sparse array support zarr-developers/zarr-specs#245

Open

brendan-m-murphy mentioned this issue Oct 16, 2023

Have a look into sparse storage of the footprint data openghg/openghg#778

Open

keller-mark mentioned this issue Feb 16, 2024

>1 million cell datasets for demos vitessce/vitessce-python#321

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting sparse matrices directly to persistent zarr arrays #152

Converting sparse matrices directly to persistent zarr arrays #152

gokceneraslan commented Aug 2, 2017 •

edited

Loading

alimanfoo commented Aug 2, 2017 via email

jakirkham commented Dec 9, 2017 •

edited

Loading

jakirkham commented Feb 14, 2018

alimanfoo commented Feb 19, 2018 via email

jakirkham commented Dec 15, 2018 •

edited

Loading

jakirkham commented Dec 15, 2018 •

edited

Loading

jakirkham commented Jun 19, 2019

rabernat commented Jun 19, 2019

Converting sparse matrices directly to persistent zarr arrays #152

Converting sparse matrices directly to persistent zarr arrays #152

Comments

gokceneraslan commented Aug 2, 2017 • edited Loading

alimanfoo commented Aug 2, 2017 via email

jakirkham commented Dec 9, 2017 • edited Loading

jakirkham commented Feb 14, 2018

alimanfoo commented Feb 19, 2018 via email

jakirkham commented Dec 15, 2018 • edited Loading

jakirkham commented Dec 15, 2018 • edited Loading

jakirkham commented Jun 19, 2019

rabernat commented Jun 19, 2019

gokceneraslan commented Aug 2, 2017 •

edited

Loading

jakirkham commented Dec 9, 2017 •

edited

Loading

jakirkham commented Dec 15, 2018 •

edited

Loading

jakirkham commented Dec 15, 2018 •

edited

Loading