get_anndata is slow for large gene lists #36

bkmartinjr · 2022-12-10T21:27:41Z

Due to an issue with TileDB query conditions, var/obs queries with very large number of values used in an in expression will be very slow (time is roughly linear to the number of items in the list).

For example, if this query has a very large list of lung_genes, it will be very slow.

lung_genes = list(lung_var["feature_id"])
lung_adata = cell_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_query={"tissue_general": "lung", "is_primary_data": True},
    var_query={"feature_id": lung_genes},
)

This expands to a value_filter on the var DataFrame that looks like feature_id in ["gene1", "gene2", ..., "geneN"], which currently has performance roughly O(N), where N is the number of possible matches (right side of the in operator).

a MUCH faster alternative is to directly use the soma_joinids (coordinates) and skip the table scan.

The filter issue has been reported to TileDB. Still considering an appropriate work-around for the get_anndata API. It may be helpful to expose the coords that the underlying experiment query supports.

~~Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310~~

~~Update [2023-07-11] this work did not make the TileDB embedded 2.16 release train, and is now slated for 2.17 (ETA Q3 2023)~~

Update [2023-09-29] this is now slated for early Q4 2023 in a 1.5.X release.

Update [2023-10-12] this has been fixed via single-cell-data/TileDB-SOMA#1756 Should be in the next release.

performance test case:

"""
Test case for cellxgene-census#36
"""

import sys
import time

import cellxgene_census
import tiledbsoma


def main():
    print(tiledbsoma.show_package_versions())
    organism = "homo_sapiens"
    with cellxgene_census.open_soma(census_version="latest") as census:
        exp = census["census_data"][organism]
        var_df = exp.ms["RNA"].var.read().concat().to_pandas()
        test_cases = [
            slice(0, 10),
            slice(0, 100),
            slice(0, 1000, 10),
            slice(0, 250),
            slice(0, 500),
            slice(0, 1000),
            slice(0, 2000),
            slice(0, 4000),
        ]

        print("test,                   coords,    filter")
        for test_slc in test_cases:
            genes = var_df[test_slc]

            # Coordinates test
            var_coords = genes.soma_joinid.to_numpy()
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(coords=(var_coords,)).concat()
            t_coords = time.perf_counter() - t_start

            # Value filter test
            var_value_filter = f"feature_id in {genes.feature_id.to_list()}"
            t_start = time.perf_counter()
            exp.ms['RNA'].var.read(value_filter=var_value_filter).concat()
            t_value_filter = time.perf_counter() - t_start

            print(f"{str(test_slc):>20}, {t_coords:8.2f}, {t_value_filter:8.2f}")


if __name__ == "__main__":
    sys.exit(main())

Results of above test case as of 20230927, on main branch and previous release:

tiledbsoma.__version__        1.4.3
TileDB-Py tiledb.version()    (0, 22, 3)
TileDB core version           2.16.3
libtiledbsoma version()       libtiledb=2.16.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.29,     0.33
 slice(0, 100, None),     0.30,     0.70
  slice(0, 1000, 10),     0.25,     0.62
 slice(0, 250, None),     0.36,     1.01
 slice(0, 500, None),     0.24,     1.71
slice(0, 1000, None),     0.24,     3.25
slice(0, 2000, None),     0.29,     6.28
slice(0, 4000, None),     0.29,    12.81


tiledbsoma.__version__        1.5.0rc0.post9.dev3876950273
TileDB-Py tiledb.version()    (0, 23, 0)
TileDB core version           2.17.0
libtiledbsoma version()       libtiledb=2.17.0
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1012-aws
None
The "latest" release is currently 2023-09-25. Specify 'census_version="2023-09-25"' in future calls to open_soma() to ensure data consistency.
test,                   coords,    filter
  slice(0, 10, None),     0.28,     0.44
 slice(0, 100, None),     0.29,     0.68
  slice(0, 1000, 10),     0.24,     0.67
 slice(0, 250, None),     0.34,     1.09
 slice(0, 500, None),     0.26,     2.42
slice(0, 1000, None),     0.28,     3.23
slice(0, 2000, None),     0.25,     6.32
slice(0, 4000, None),     0.31,    13.51

The text was updated successfully, but these errors were encountered:

atolopko-czi · 2023-01-04T18:57:46Z

@bkmartinjr: This issue is now just waiting for the TileDB fix for the underlying query condition, right? Unless we want to add performance tests it seems this isn't actionable. Close?

Or maybe we have the API issue a warning or error on a large value set for the var query.

bkmartinjr · 2023-01-04T19:02:30Z

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

atolopko-czi · 2023-01-04T19:09:38Z

I don't think we should add any work-arounds.

I was leaving it open to track the issue on our side, as we don't have access to their tracking system. I didn't want to lose sight of our need for this to be fixed soon. Do you have an alternative preference for tracking these types of dependencies?

Added a (new) blocked label for now.

brianraymor · 2023-05-24T17:22:29Z

Update: ETA from TileDB for a fix to the underlying query condition is late Q1. Tracking id 24310

@ebezzi to follow up with TileDB to understand if this is unblocked. In the future, we will request that TileDB open a corresponding github issue that can block issues like this one.

bkmartinjr · 2023-05-24T17:35:50Z

No need to ask - I already did two weeks ago :-).

The enhancement is slated for TileDB core v 2.16, which is imminent. After that, it simply requires incorporation into tiledbsoma.

bkmartinjr · 2023-10-19T19:48:22Z

Update: verified that this is resolved by the tiledbsoma 1.5RC. Awaiting the actual 1.5 release, after which the cellxgene-census package will release with an updated dependency pin.

bkmartinjr self-assigned this Dec 13, 2022

bkmartinjr mentioned this issue Dec 14, 2022

docs: Add comp bio notebooks #45

Merged

bkmartinjr changed the title ~~get_anndata and experiment_query are slow for large gene lists~~ get_anndata is slow for large gene lists Dec 20, 2022

bkmartinjr mentioned this issue Dec 20, 2022

simplify get_anndata API #52

Merged

atolopko-czi added the blocked Issue cannot be completed due to an external dependency. label Jan 4, 2023

pablo-gar added the P2 Priority 2 - Improvement with narrower impact, fix within a month label Feb 13, 2023

atolopko-czi mentioned this issue Mar 8, 2023

Add validation data checks #243

Merged

pablo-gar added P1 Priority 1 - Improvement with wide impact, fix within 1 week and removed P2 Priority 2 - Improvement with narrower impact, fix within a month labels Mar 12, 2023

johnkerl mentioned this issue Sep 29, 2023

[r/python] Improve query performance for in conditions having many terms single-cell-data/TileDB-SOMA#1742

Closed

bkmartinjr mentioned this issue Nov 7, 2023

prep for 1.8 release #852

Merged

ebezzi closed this as completed in #852 Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_anndata is slow for large gene lists #36

get_anndata is slow for large gene lists #36

bkmartinjr commented Dec 10, 2022 •

edited

Loading

atolopko-czi commented Jan 4, 2023 •

edited

Loading

bkmartinjr commented Jan 4, 2023

atolopko-czi commented Jan 4, 2023

brianraymor commented May 24, 2023

bkmartinjr commented May 24, 2023

bkmartinjr commented Oct 19, 2023

get_anndata is slow for large gene lists #36

get_anndata is slow for large gene lists #36

Comments

bkmartinjr commented Dec 10, 2022 • edited Loading

atolopko-czi commented Jan 4, 2023 • edited Loading

bkmartinjr commented Jan 4, 2023

atolopko-czi commented Jan 4, 2023

brianraymor commented May 24, 2023

bkmartinjr commented May 24, 2023

bkmartinjr commented Oct 19, 2023

bkmartinjr commented Dec 10, 2022 •

edited

Loading

atolopko-czi commented Jan 4, 2023 •

edited

Loading