[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

pablo-gar · 2023-07-12T19:16:39Z

Is your feature request related to a problem? Please describe.
The read() methods for SOMA sparse arrays are meant to produce iterators. These iterators for each iteration return a COO-formatted chunk of the object defined in bytes by soma.init.buffer, or in the future chunk size may be defined in number COO rows by batch_size #1527.

One limitation is that for both soma.init.buffer or batch_size it is never guaranteed that one full vector along an axis (row or column for 2D Arrays) from the sparse array is returned in an iterator chunk. In other words it is likely that the data from a given axis vector (e.g. row/column) is split into two different chunks.

This behavior makes it challenging to easily implement accumulative operations that depend on full row/column data.

Describe the solution you'd like
An option somewhere to specify N: “for each iteration return always at most N full vectors along an axis (e.g. column/row for 2D Arrays)” .

Additional context
see https://cziscience.slack.com/archives/C04LMG88VKJ/p1689102998962389

The text was updated successfully, but these errors were encountered:

atolopko-czi · 2023-07-31T20:26:13Z

Pablo will provide more requirements/proposal prior to technical discussion

atolopko-czi · 2023-09-11T19:10:19Z

Another twist: Supporting a max X nnz count per chunk, but still with the "full obs row" guarantee. This would return similar sized chunks in terms of memory usage, rather than rows.

Motivation: If X nnz/row ratio varies across an experiment's obs axis, this can result in high variability memory usage across chunks. This makes it hard to ensure predictable memory usage for a SOMA reader process. This can occur, for one, when the Experiment is aggregating data from multiple sources, as is done in the CELLxGENE census.

bkmartinjr · 2023-11-07T18:51:46Z

@johnkerl - can this be closed given #1792 ?

johnkerl · 2023-11-07T18:54:53Z

@bkmartinjr yup -- I am doing some post-1.5.0 issue closes right now 😎

pablo-gar added python-api r-api labels Jul 12, 2023

johnkerl closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

pablo-gar commented Jul 12, 2023 •

edited

Loading

atolopko-czi commented Jul 31, 2023

atolopko-czi commented Sep 11, 2023

bkmartinjr commented Nov 7, 2023

johnkerl commented Nov 7, 2023

[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

Comments

pablo-gar commented Jul 12, 2023 • edited Loading

atolopko-czi commented Jul 31, 2023

atolopko-czi commented Sep 11, 2023

bkmartinjr commented Nov 7, 2023

johnkerl commented Nov 7, 2023

pablo-gar commented Jul 12, 2023 •

edited

Loading