-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting sparse matrices directly to persistent zarr arrays #152
Comments
Thank you for reporting. I'm sure there will be a way to do this without
converting the entire sparse array to dense format in one go. I'm away from
office at the moment but happy to look when back.
On Wed, 2 Aug 2017 at 16:10, Gökçen Eraslan ***@***.***> wrote:
I'm trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to
a persistent zarr array. Here is the code I'm using:
import tablesimport scipy.sparse as sp_sparseimport zarr
f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
f.root.mm10.indices[:],
f.root.mm10.indptr[:]),
shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)
However, this fails because scipy cannot convert sparse matrices with more
than 2**31 non-zero elements to dense (scipy/scipy#7230
<scipy/scipy#7230>).
So I was wondering whether it would be possible to convert a sparse matrix
to zarr without converting it first to dense array.
If I remove the dense conversion line (matrix = matrix.todense()), it
fails with the following exception:
Traceback (most recent call last):
File "convert.py", line 23, in <module>
zarr.array(matrix, store='1m.zarr', overwrite=True)
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
z[:] = data
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
value[value_selection])
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
self._chunk_setitem_nosync(cidx, item, value)
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
chunk = np.ascontiguousarray(value, dtype=self._dtype)
File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done
Here is the URL to the HDF file for full reproducibility:
https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
This issue can also be reproduced easily via the following code:
import numpy as npimport scipy.sparse as sp
x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/152>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QsrYrHDJOQNLjtvauNFnRyUyAPiCks5sUJF6gaJpZM4OrPOV>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
In In [1]: import numpy as np
In [2]: import scipy.sparse as sparse
In [3]: import zarr
In [4]: row = np.array([0, 3, 1, 0])
In [5]: col = np.array([0, 3, 1, 2])
In [6]: data = np.array([4, 5, 7, 9])
In [7]: a = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
In [8]: z = zarr.zeros(shape=a.shape, chunks=4, dtype=a.dtype)
In [9]: z.set_coordinate_selection((a.row, a.col), a.data)
In [10]: z[...]
Out[10]:
array([[4, 0, 9, 0],
[0, 7, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 5]]) Edit: Here are the docs on coordinate indexing. |
Going to close as Zarr |
Nice application of coordinate indexing!
On Wed, 14 Feb 2018 at 05:17, jakirkham ***@***.***> wrote:
Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the
comment above. Guessing this solves your issue @gokceneraslan
<https://github.com/gokceneraslan>, so will close this out. However if
that is not the case, please feel free to let us know and we can reopen and
discuss further. Thanks for the report.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqWDePji_1rLcI_mkgtcTq6PzJwzks5tUmxggaJpZM4OrPOV>
.
--
If I do not respond to an email within a few days and you need a response,
please feel free to resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Skype: londonbonsaipurple
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Just as an additional comment here, users can customize this approach as much as is needed. First users are able to disable compression. Second chunks can be set so as to each include a single value. For instance, based off of the example above, one can do the following. In [1]: import zarr
In [2]: z = zarr.zeros(shape=(4, 4), chunks=(1, 1), dtype=int, compressor=None)
In [3]: z.set_coordinate_selection(([0, 3, 1, 0], [0, 3, 1, 2]), [4, 5, 7, 9])
In [4]: z.chunk_store
Out[4]:
{'.zarray': b'{\n "chunks": [\n 1,\n 1\n ],\n "compressor": null,\n "dtype": "<i8",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n 4,\n 4\n ],\n "zarr_format": 2\n}',
'0.0': array([[4]]),
'0.2': array([[9]]),
'1.1': array([[7]]),
'3.3': array([[5]])} This amounts to an in-memory representation of a DOK sparse array and can easily be used with other stores. Thus it can trivially be written to disk, stored in databases, and/or written to a cloud-based key-value store. Sorry if this is all already obvious, just figured I'd point it out in case it wasn't. Happy to add an example like this to the docs if it helps. :) Minor note: Currently the in-memory representation of the |
Should add this is also possible with the newer Sparse library, which is gaining traction. Here's the same example as before, but with Sparse instead. Have skipped inspecting the Zarr Array as it is no different from the previous examples. In [1]: import numpy as np
In [2]: import sparse
In [3]: import zarr
In [4]: coords = [[0, 3, 1, 0], [0, 3, 1, 2]]
In [5]: data = np.array([4, 5, 7, 9])
In [6]: shape = (4, 4)
In [7]: a = sparse.COO(coords, data, shape=shape)
In [8]: z = zarr.zeros(shape=a.shape, chunks=a.ndim*(1,), dtype=a.dtype, compres
...: sor=None)
In [9]: z.set_coordinate_selection(tuple(a.coords), a.data) FWIW I'm not sure there is much value in actually working with an in-memory Zarr array in this case as Sparse already has an in-memory representation and provides many computation operations that would be useful here. That said, I do think leveraging other Zarr stores to persist Sparse arrays to disk could be interesting. It also may be interesting to take a look at what storage options make sense for these Sparse arrays. Perhaps single file storage like Edit: Have raised issue ( pydata/sparse#222 ) to see if there is some interest in supporting serialization of Sparse arrays to Zarr Arrays to streamline things a bit. Not that performing the above task is difficult, but it could be nice for users if this was a trivial line of code. |
cc @ryan-williams (in case this and/or the xrefs are of interest) |
👍 following. I am currently using the sparse package for ocean model transport matrix diagnostics. I would love to be able to efficiently serialize these to zarr. |
I'm trying to convert a huge scipy csc_matrix (
27998 x 1306127
,int32
) to a persistent zarr array. Here is the code I'm using:However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (scipy/scipy#7230).
So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.
If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:
Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
This issue can also be reproduced easily via the following code:
The text was updated successfully, but these errors were encountered: