Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_anndata is slow for large gene lists #36

Closed
bkmartinjr opened this issue Dec 10, 2022 · 6 comments · Fixed by #852
Closed

get_anndata is slow for large gene lists #36

bkmartinjr opened this issue Dec 10, 2022 · 6 comments · Fixed by #852
Assignees
Labels
blocked Issue cannot be completed due to an external dependency. P1 Priority 1 - Improvement with wide impact, fix within 1 week

Comments

@bkmartinjr
Copy link
Contributor

bkmartinjr commented Dec 10, 2022

Due to an issue with TileDB query conditions, var/obs queries with very large number of values used in an in expression will be very slow (time is roughly linear to the number of items in the list).

For example, if this query has a very large list of lung_genes, it will be very slow.

lung_genes = list(lung_var["feature_id"])
lung_adata = cell_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_query={"tissue_general": "lung", "is_primary_data": True},
    var_query={"feature_id": lung_genes},
)

This expands to a value_filter on the var DataFrame that looks like feature_id in ["gene1", "gene2", ..., "geneN"], which currently has performance roughly O(N), where N is the number of possible matches (right side of the in operator).

a MUCH faster alternative is to directly use the soma_joinids (coordinates) and skip the table scan.

The filter issue has been reported to TileDB. Still considering an appropriate work-around for the get_anndata API. It may be helpful to expose the coords that the underlying experiment query supports.


Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310

Update [2023-07-11] this work did not make the TileDB embedded 2.16 release train, and is now slated for 2.17 (ETA Q3 2023)

Update [2023-09-29] this is now slated for early Q4 2023 in a 1.5.X release.

Update [2023-10-12] this has been fixed via single-cell-data/TileDB-SOMA#1756 Should be in the next release.


performance test case:

"""
Test case for cellxgene-census#36
"""

import sys
import time

import cellxgene_census
import tiledbsoma


def main():
    print(tiledbsoma.show_package_versions())
    organism = "homo_sapiens"
    with cellxgene_census.open_soma(census_version="latest") as census:
        exp = census["census_data"][organism]
        var_df = exp.ms["RNA"].var.read().concat().to_pandas()
        test_cases = [
            slice(0, 10),
            slice(0, 100),
            slice(0, 1000, 10),
            slice(0, 250),
            slice(0, 500),
            slice(0, 1000),
            slice(0, 2000),
            slice(0, 4000),
        ]

        print("test,                   coords,    filter")
        for test_slc in test_cases:
            genes = var_df[test_slc]

            # Coordinates test
            var_coords = genes.soma_joinid.to_numpy()
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(coords=(var_coords,)).concat()
            t_coords = time.perf_counter() - t_start

            # Value filter test
            var_value_filter = f"feature_id in {genes.feature_id.to_list()}"
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(value_filter=var_value_filter).concat()
            t_value_filter = time.perf_counter() - t_start

            print(f"{str(test_slc):>20}, {t_coords:8.2f}, {t_value_filter:8.2f}")


if __name__ == "__main__":
    sys.exit(main())

Results of above test case as of 20230927, on main branch and previous release:

tiledbsoma.__version__        1.4.3
TileDB-Py tiledb.version()    (0, 22, 3)
TileDB core version           2.16.3
libtiledbsoma version()       libtiledb=2.16.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.29,     0.33
 slice(0, 100, None),     0.30,     0.70
  slice(0, 1000, 10),     0.25,     0.62
 slice(0, 250, None),     0.36,     1.01
 slice(0, 500, None),     0.24,     1.71
slice(0, 1000, None),     0.24,     3.25
slice(0, 2000, None),     0.29,     6.28
slice(0, 4000, None),     0.29,    12.81


tiledbsoma.__version__        1.5.0rc0.post9.dev3876950273
TileDB-Py tiledb.version()    (0, 23, 0)
TileDB core version           2.17.0
libtiledbsoma version()       libtiledb=2.17.0
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.28,     0.44
 slice(0, 100, None),     0.29,     0.68
  slice(0, 1000, 10),     0.24,     0.67
 slice(0, 250, None),     0.34,     1.09
 slice(0, 500, None),     0.26,     2.42
slice(0, 1000, None),     0.28,     3.23
slice(0, 2000, None),     0.25,     6.32
slice(0, 4000, None),     0.31,    13.51
@bkmartinjr bkmartinjr self-assigned this Dec 13, 2022
@bkmartinjr bkmartinjr changed the title get_anndata and experiment_query are slow for large gene lists get_anndata is slow for large gene lists Dec 20, 2022
@atolopko-czi
Copy link
Collaborator

atolopko-czi commented Jan 4, 2023

@bkmartinjr: This issue is now just waiting for the TileDB fix for the underlying query condition, right? Unless we want to add performance tests it seems this isn't actionable. Close?

Or maybe we have the API issue a warning or error on a large value set for the var query.

@bkmartinjr
Copy link
Contributor Author

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

@atolopko-czi atolopko-czi added the blocked Issue cannot be completed due to an external dependency. label Jan 4, 2023
@atolopko-czi
Copy link
Collaborator

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

Added a (new) blocked label for now.

@pablo-gar pablo-gar added the P2 Priority 2 - Improvement with narrower impact, fix within a month label Feb 13, 2023
@pablo-gar pablo-gar added P1 Priority 1 - Improvement with wide impact, fix within 1 week and removed P2 Priority 2 - Improvement with narrower impact, fix within a month labels Mar 12, 2023
@brianraymor
Copy link

Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310

@ebezzi to follow up with TileDB to understand if this is unblocked. In the future, we will request that TileDB open a corresponding github issue that can block issues like this one.

@bkmartinjr
Copy link
Contributor Author

No need to ask - I already did two weeks ago :-).

The enhancement is slated for TileDB core v 2.16, which is imminent. After that, it simply requires incorporation into tiledbsoma.

@bkmartinjr
Copy link
Contributor Author

Update: verified that this is resolved by the tiledbsoma 1.5RC. Awaiting the actual 1.5 release, after which the cellxgene-census package will release with an updated dependency pin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Issue cannot be completed due to an external dependency. P1 Priority 1 - Improvement with wide impact, fix within 1 week
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants