You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The read() methods for SOMA sparse arrays are meant to produce iterators. These iterators for each iteration return a COO-formatted chunk of the object defined in bytes by soma.init.buffer, or in the future chunk size may be defined in number COO rows by batch_size#1527.
One limitation is that for both soma.init.buffer or batch_size it is never guaranteed that one full vector along an axis (row or column for 2D Arrays) from the sparse array is returned in an iterator chunk. In other words it is likely that the data from a given axis vector (e.g. row/column) is split into two different chunks.
This behavior makes it challenging to easily implement accumulative operations that depend on full row/column data.
Describe the solution you'd like
An option somewhere to specify N: “for each iteration return always at most N full vectors along an axis (e.g. column/row for 2D Arrays)” .
Another twist: Supporting a max X nnz count per chunk, but still with the "full obs row" guarantee. This would return similar sized chunks in terms of memory usage, rather than rows.
Motivation: If X nnz/row ratio varies across an experiment's obs axis, this can result in high variability memory usage across chunks. This makes it hard to ensure predictable memory usage for a SOMA reader process. This can occur, for one, when the Experiment is aggregating data from multiple sources, as is done in the CELLxGENE census.
Is your feature request related to a problem? Please describe.
The
read()
methods for SOMA sparse arrays are meant to produce iterators. These iterators for each iteration return a COO-formatted chunk of the object defined in bytes bysoma.init.buffer
, or in the future chunk size may be defined in number COO rows bybatch_size
#1527.One limitation is that for both
soma.init.buffer
orbatch_size
it is never guaranteed that one full vector along an axis (row or column for 2D Arrays) from the sparse array is returned in an iterator chunk. In other words it is likely that the data from a given axis vector (e.g. row/column) is split into two different chunks.This behavior makes it challenging to easily implement accumulative operations that depend on full row/column data.
Describe the solution you'd like
An option somewhere to specify N: “for each iteration return always at most N full vectors along an axis (e.g. column/row for 2D Arrays)” .
Additional context
see https://cziscience.slack.com/archives/C04LMG88VKJ/p1689102998962389
The text was updated successfully, but these errors were encountered: