Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting sparse matrices directly to persistent zarr arrays #152

Closed
gokceneraslan opened this issue Aug 2, 2017 · 8 comments
Closed
Labels
enhancement New features or improvements

Comments

@gokceneraslan
Copy link

gokceneraslan commented Aug 2, 2017

I'm trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to a persistent zarr array. Here is the code I'm using:

import tables
import scipy.sparse as sp_sparse
import zarr

f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
                               f.root.mm10.indices[:],
                               f.root.mm10.indptr[:]),
                              shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)

However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (scipy/scipy#7230).

So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.

If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:

Traceback (most recent call last):
  File "convert.py", line 23, in <module>
    zarr.array(matrix, store='1m.zarr', overwrite=True)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
    z[:] = data
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
    value[value_selection])
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
    self._chunk_setitem_nosync(cidx, item, value)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
    chunk = np.ascontiguousarray(value, dtype=self._dtype)
  File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done

Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

This issue can also be reproduced easily via the following code:

import numpy as np
import scipy.sparse as sp
import zarr

x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)
@alimanfoo
Copy link
Member

alimanfoo commented Aug 2, 2017 via email

@alimanfoo alimanfoo added the enhancement New features or improvements label Nov 21, 2017
@jakirkham
Copy link
Member

jakirkham commented Dec 9, 2017

In master, there are some new features that should help with this. Have included a simplistic example below to demonstrate this. It uses a coo_matrix array. That said, it is possible to convert from the csc_matrix to the coo_matrix using the tocoo method.

In [1]: import numpy as np

In [2]: import scipy.sparse as sparse

In [3]: import zarr

In [4]: row = np.array([0, 3, 1, 0])

In [5]: col = np.array([0, 3, 1, 2])

In [6]: data = np.array([4, 5, 7, 9])

In [7]: a = sparse.coo_matrix((data, (row, col)), shape=(4, 4))

In [8]: z = zarr.zeros(shape=a.shape, chunks=4, dtype=a.dtype)

In [9]: z.set_coordinate_selection((a.row, a.col), a.data)

In [10]: z[...]
Out[10]: 
array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

Edit: Here are the docs on coordinate indexing.

@jakirkham
Copy link
Member

Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 19, 2018 via email

@jakirkham
Copy link
Member

jakirkham commented Dec 15, 2018

Just as an additional comment here, users can customize this approach as much as is needed. First users are able to disable compression. Second chunks can be set so as to each include a single value. For instance, based off of the example above, one can do the following.

In [1]: import zarr                                                             

In [2]: z = zarr.zeros(shape=(4, 4), chunks=(1, 1), dtype=int, compressor=None) 

In [3]: z.set_coordinate_selection(([0, 3, 1, 0], [0, 3, 1, 2]), [4, 5, 7, 9])  

In [4]: z.chunk_store                                                           
Out[4]: 
{'.zarray': b'{\n    "chunks": [\n        1,\n        1\n    ],\n    "compressor": null,\n    "dtype": "<i8",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        4,\n        4\n    ],\n    "zarr_format": 2\n}',
 '0.0': array([[4]]),
 '0.2': array([[9]]),
 '1.1': array([[7]]),
 '3.3': array([[5]])}

This amounts to an in-memory representation of a DOK sparse array and can easily be used with other stores. Thus it can trivially be written to disk, stored in databases, and/or written to a cloud-based key-value store.

Sorry if this is all already obvious, just figured I'd point it out in case it wasn't. Happy to add an example like this to the docs if it helps. :)


Minor note: Currently the in-memory representation of the chunk_store may change a little bit in the next release (e.g. using bytes objects for values ( #359 ) and possibly defaulting to a different store than dict ( #351 )), but this won't affect the availability of this functionality or how any of the other stores work.

@jakirkham
Copy link
Member

jakirkham commented Dec 15, 2018

Should add this is also possible with the newer Sparse library, which is gaining traction. Here's the same example as before, but with Sparse instead. Have skipped inspecting the Zarr Array as it is no different from the previous examples.

In [1]: import numpy as np                                                      

In [2]: import sparse                                                           

In [3]: import zarr                                                             

In [4]: coords = [[0, 3, 1, 0], [0, 3, 1, 2]]                                   

In [5]: data = np.array([4, 5, 7, 9])                                           

In [6]: shape = (4, 4)                                                          

In [7]: a = sparse.COO(coords, data, shape=shape)                               

In [8]: z = zarr.zeros(shape=a.shape, chunks=a.ndim*(1,), dtype=a.dtype, compres
   ...: sor=None)                                                               

In [9]: z.set_coordinate_selection(tuple(a.coords), a.data)                     

FWIW I'm not sure there is much value in actually working with an in-memory Zarr array in this case as Sparse already has an in-memory representation and provides many computation operations that would be useful here. That said, I do think leveraging other Zarr stores to persist Sparse arrays to disk could be interesting.

It also may be interesting to take a look at what storage options make sense for these Sparse arrays. Perhaps single file storage like DBMStore, LMDBStore or ZipStore make more sense. Alternatively there may be other storage formats that people like to use. Would be interesting to hear a bit more about real world use cases.

Edit: Have raised issue ( pydata/sparse#222 ) to see if there is some interest in supporting serialization of Sparse arrays to Zarr Arrays to streamline things a bit. Not that performing the above task is difficult, but it could be nice for users if this was a trivial line of code.

@jakirkham
Copy link
Member

cc @ryan-williams (in case this and/or the xrefs are of interest)

@rabernat
Copy link
Contributor

👍 following. I am currently using the sparse package for ocean model transport matrix diagnostics. I would love to be able to efficiently serialize these to zarr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

4 participants