Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Sparse ND Array iterators to guarantee full axis vectors (e.g. column/row) per iteration #1528

Closed
pablo-gar opened this issue Jul 12, 2023 · 4 comments

Comments

@pablo-gar
Copy link
Member

pablo-gar commented Jul 12, 2023

Is your feature request related to a problem? Please describe.
The read() methods for SOMA sparse arrays are meant to produce iterators. These iterators for each iteration return a COO-formatted chunk of the object defined in bytes by soma.init.buffer, or in the future chunk size may be defined in number COO rows by batch_size #1527.

One limitation is that for both soma.init.buffer or batch_size it is never guaranteed that one full vector along an axis (row or column for 2D Arrays) from the sparse array is returned in an iterator chunk. In other words it is likely that the data from a given axis vector (e.g. row/column) is split into two different chunks.

This behavior makes it challenging to easily implement accumulative operations that depend on full row/column data.

Describe the solution you'd like
An option somewhere to specify N: “for each iteration return always at most N full vectors along an axis (e.g. column/row for 2D Arrays)” .

Additional context
see https://cziscience.slack.com/archives/C04LMG88VKJ/p1689102998962389

@atolopko-czi
Copy link
Member

Pablo will provide more requirements/proposal prior to technical discussion

@atolopko-czi
Copy link
Member

Another twist: Supporting a max X nnz count per chunk, but still with the "full obs row" guarantee. This would return similar sized chunks in terms of memory usage, rather than rows.

Motivation: If X nnz/row ratio varies across an experiment's obs axis, this can result in high variability memory usage across chunks. This makes it hard to ensure predictable memory usage for a SOMA reader process. This can occur, for one, when the Experiment is aggregating data from multiple sources, as is done in the CELLxGENE census.

@bkmartinjr
Copy link
Member

@johnkerl - can this be closed given #1792 ?

@johnkerl johnkerl closed this as completed Nov 7, 2023
@johnkerl
Copy link
Member

johnkerl commented Nov 7, 2023

@bkmartinjr yup -- I am doing some post-1.5.0 issue closes right now 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants