-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_anndata is slow for large gene lists #36
Comments
@bkmartinjr: This issue is now just waiting for the TileDB fix for the underlying query condition, right? Unless we want to add performance tests it seems this isn't actionable. Close? Or maybe we have the API issue a warning or error on a large value set for the var query. |
I don't think we should add any work-arounds. I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies? |
Added a (new) |
@ebezzi to follow up with TileDB to understand if this is unblocked. In the future, we will request that TileDB open a corresponding github issue that can block issues like this one. |
No need to ask - I already did two weeks ago :-). The enhancement is slated for TileDB core v 2.16, which is imminent. After that, it simply requires incorporation into tiledbsoma. |
Update: verified that this is resolved by the tiledbsoma 1.5RC. Awaiting the actual 1.5 release, after which the cellxgene-census package will release with an updated dependency pin. |
Due to an issue with TileDB query conditions, var/obs queries with very large number of values used in an
in
expression will be very slow (time is roughly linear to the number of items in the list).For example, if this query has a very large list of
lung_genes
, it will be very slow.This expands to a
value_filter
on thevar
DataFrame that looks likefeature_id in ["gene1", "gene2", ..., "geneN"]
, which currently has performance roughly O(N), where N is the number of possible matches (right side of thein
operator).a MUCH faster alternative is to directly use the
soma_joinids
(coordinates) and skip the table scan.The filter issue has been reported to TileDB. Still considering an appropriate work-around for the
get_anndata
API. It may be helpful to expose the coords that the underlying experiment query supports.Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310Update [2023-07-11] this work did not make the TileDB embedded 2.16 release train, and is now slated for 2.17 (ETA Q3 2023)Update [2023-09-29] this is now slated for early Q4 2023 in a 1.5.X release.
Update [2023-10-12] this has been fixed via single-cell-data/TileDB-SOMA#1756 Should be in the next release.
performance test case:
Results of above test case as of 20230927, on main branch and previous release:
The text was updated successfully, but these errors were encountered: