Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] CSR/CSC iterator over ExperimentAxisQuery.X #1503

Closed
bkmartinjr opened this issue Jun 23, 2023 · 2 comments · Fixed by #1792
Closed

[Feature request] CSR/CSC iterator over ExperimentAxisQuery.X #1503

bkmartinjr opened this issue Jun 23, 2023 · 2 comments · Fixed by #1792
Assignees

Comments

@bkmartinjr
Copy link
Member

bkmartinjr commented Jun 23, 2023

It is a very common pattern to iterate of an X layer slice by rows, where the slice is specified by an ExperimentAxisQuery, and the consumer wants the slices in a CSR/CSC format for computation (e.g., scipy.csr_matrix). The current API only provides COO iteration and does not provide direct reindexing support of joinids.

Propose an iterator API that:

  • reads slices of a given size
  • reindexes the joinids
  • returns for each step the joinids and reindexed sparse matrix

In Python type sigs, a row-based iterator method on ExperimentAxisQuery might look like:

_RT = Tuple[Tuple[npt.NDArray[np.int64], npt.NDArray[np.int64]], sparse.spmatrix]

def X_sparse_iter(
    self: soma.ExperimentAxisQuery,
    X_name: str = "raw",  # the X layer to read
    row_stride: int = 2**16,  # row stride for each step
    fmt: Literal["csr", "csc"] = "csr",  # the resulting sparse format
) -> Iterator[_RT]

Example usage:

with experiment.axis_query(...) as query:
    for (obs_joinids, var_join_ids), X_chunk in query.X_sparse_iter(X_name="raw"):
        ...

In this case, obs_joinids[i] and var_joinids[j] corresponds to X[i,j].

I have a fully functional and working prototype implementation available here. Important:

  • The prototype does not utilize the ExperimentAxisQuery fast csr conversion, which would be necessary work for a "real" implementation.
  • The prototype does lazy multi-threaded pipelining of the steps, which is essential for performance on the queries we typical do on the Census.

There is a notebook in the same directory that shows example usage.

@johnkerl johnkerl changed the title [Feature request] CSR/CSC iterator over ExperimentAxisQuery.X [Feature request] CSR/CSC iterator over ExperimentAxisQuery.X Jun 26, 2023
@johnkerl johnkerl self-assigned this Jun 29, 2023
@johnkerl
Copy link
Member

johnkerl commented Jun 29, 2023

Self-assigning to track collaboration as linked here: #718 (comment)

@bkmartinjr
Copy link
Member Author

bkmartinjr commented Jun 29, 2023

Couple of more comments:

  • the API should also work over obsm/obsp/varm/varp
  • there is an updated prototype here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants