[Feature request] `ExperimentAxisQuery.to_anndata()` (and R analogs) should drop unused categories on axis data frames #2765

pablo-gar · 2024-07-02T00:55:43Z

Is your feature request related to a problem? Please describe.
Some analytical pipelines, specially those that relate to visualizations rely on the categories of pandas.Categorical. In the case of large SOMAExperiments, many times a query will result on unused categories for potentially many columns of obs or var, thus the user needs to always iterate on all columns and perform a cat.remove_unused_categories() operation.

See for example this reproducible example

import cellxgene_census
import scanpy as sc
census = cellxgene_census.open_soma(census_version="2024-05-20")

human = census["census_data"]["homo_sapiens"]
query = human.axis_query(
    measurement_name = "RNA",
    obs_query = tiledbsoma.AxisQuery(
        value_filter = "tissue == 'tongue' and is_primary_data == True"
    )
)

adata = query.to_anndata(column_names={"obs": ["tissue"]}, X_name = "normalized")
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="tissue")

Only one "tissue" was selected but all hundreds of tissues are drawn in the umap

Describe the solution you'd like
ExperimentAxisQuery.to_anndata() returns an anndata with unused categories already removed in the axis data frames

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-07-15T16:20:02Z

Needs Python & R both. We can subtask.

eddelbuettel · 2024-07-15T16:32:00Z

For R it is a pretty common task and a base R function

> example(droplevels)

drplvl> aq <- transform(airquality, Month = factor(Month, labels = month.abb[5:9]))

drplvl> aq <- subset(aq, Month != "Jul")

drplvl> table(           aq $Month)

May Jun Jul Aug Sep 
 31  30   0  31  30 

drplvl> table(droplevels(aq)$Month)

May Jun Aug Sep 
 31  30  31  30 
>

mojaveazure · 2024-07-24T14:05:46Z

I get why this is asked for, but I would not want to do it by default in the R API. R explicitly keeps unused factor levels, and while this is frustrating, it's a practice I adhere to even in Seurat. We make this easier in Seurat by exposing drop arguments to drop unused factor levels when needed (set to FALSE by default to adhere to R) and building methods to droplevels() to automagically prune unused factor levels. However, we require explicit user input to drop unused levels in order to fit in with the R way of doing things

eddelbuettel · 2024-07-24T14:25:37Z

It's a topic that has generated heated online discussions in other places. @pablo-gar raises a good point with respect to the plot and its 'inflated' legend -- but as @mojaveazure noted there is also a consensus that retaining factor levels as default is preferable. One can compare this to enums in C/C++ where I find the analogy of the picky compiler warning asking us to supply all levels of an enum when we write a switch appropriate: doing otherwise may lead to gnarly silent bugs. This is a tricky question. My preference would be to do what Seurat does and offer an option to drop if requested, and I lean towards a default of 'off'.

johnkerl · 2024-07-24T14:50:31Z

Let's make this opt-in for Python and R both (default: keep all levels; drop levels only when requested) -- @pablo-gar will this meet your needs?

johnkerl · 2024-07-25T12:09:43Z

Needs R PR for Seurat/SCE export

eddelbuettel · 2024-07-25T12:22:01Z

While I am assigned may I suggest this is pivoted over to @mojaveazure instead?

ivirshup · 2024-07-25T15:55:15Z

I agree with opt in, we're going to move anndata over to this.

We initially went with opt out because categoricals were new in pandas and a lot of code threw errors when there were unused categories. It's generally much better now, and there are cases where you want to retain the categories.

…stors R analog of #2811 and single-cell-data/SOMA#204; add a `drop_levels` paramter to the ecosystem outgestors to drop unused factor levels from resulting data frames Modified SOMA methods: - `SOMAExperimentAxisQuery$to_seurat()`: add `drop_levels` to drop drop unused levels from `obs` and `var` data frames - `SOMAExperimentAxisQuery$to_seurat_assay()`: add `drop_levels` to drop unused levels from `var` data frame - `SOMAExperimentAxisQuery$to_single_cell_experiment()`: add `drop_levels` to drop unused levels from `obs` and `var` data frames Also shifts `SOMAExperimentAxisQuery$to_seurat()` and `SOMAExperimentAxisQuery$to_seurat_assay()` to use `SOMAExperimentAxisQuery$private$.load_df()` for loading `obs` and `var`; removing standalone code and increase sharing with the SCE outgestor resolves #2765 [SC-51945](https://app.shortcut.com/tiledb-inc/story/51945)

pablo-gar added the python-api label Jul 2, 2024

johnkerl self-assigned this Jul 5, 2024

johnkerl assigned nguyenv and unassigned johnkerl Jul 15, 2024

johnkerl assigned eddelbuettel Jul 15, 2024

johnkerl changed the title ~~[Feature request] ExperimentAxisQuery.to_anndata() should drop unused categories on axis data frames~~ [Feature request] ExperimentAxisQuery.to_anndata() (and R analogs) should drop unused categories on axis data frames Jul 15, 2024

nguyenv mentioned this issue Jul 24, 2024

[python] Optionally drop unused categories in ExperimentAxisQuery.to_anndata #2811

Merged

nguyenv linked a pull request Jul 24, 2024 that will close this issue

[python] Optionally drop unused categories in ExperimentAxisQuery.to_anndata #2811

Merged

nguyenv mentioned this issue Jul 24, 2024

Drop unused categories in ExperimentAxisQuery.to_anndata single-cell-data/SOMA#204

Merged

nguyenv closed this as completed in #2811 Jul 25, 2024

johnkerl reopened this Jul 25, 2024

johnkerl assigned mojaveazure and unassigned eddelbuettel and nguyenv Jul 25, 2024

mojaveazure mentioned this issue Aug 2, 2024

[r] Add drop_levels to SOMAExperimentAxisQuery -> ecosystem outgestors #2825

Merged

mojaveazure closed this as completed in #2825 Aug 2, 2024

mojaveazure closed this as completed in d6e0719 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] `ExperimentAxisQuery.to_anndata()` (and R analogs) should drop unused categories on axis data frames #2765

[Feature request] `ExperimentAxisQuery.to_anndata()` (and R analogs) should drop unused categories on axis data frames #2765

pablo-gar commented Jul 2, 2024 •

edited

Loading

johnkerl commented Jul 15, 2024

eddelbuettel commented Jul 15, 2024

mojaveazure commented Jul 24, 2024

eddelbuettel commented Jul 24, 2024

johnkerl commented Jul 24, 2024 •

edited

Loading

johnkerl commented Jul 25, 2024 •

edited

Loading

eddelbuettel commented Jul 25, 2024

ivirshup commented Jul 25, 2024

[Feature request] ExperimentAxisQuery.to_anndata() (and R analogs) should drop unused categories on axis data frames #2765

[Feature request] ExperimentAxisQuery.to_anndata() (and R analogs) should drop unused categories on axis data frames #2765

Comments

pablo-gar commented Jul 2, 2024 • edited Loading

johnkerl commented Jul 15, 2024

eddelbuettel commented Jul 15, 2024

mojaveazure commented Jul 24, 2024

eddelbuettel commented Jul 24, 2024

johnkerl commented Jul 24, 2024 • edited Loading

johnkerl commented Jul 25, 2024 • edited Loading

eddelbuettel commented Jul 25, 2024

ivirshup commented Jul 25, 2024

[Feature request] `ExperimentAxisQuery.to_anndata()` (and R analogs) should drop unused categories on axis data frames #2765

[Feature request] `ExperimentAxisQuery.to_anndata()` (and R analogs) should drop unused categories on axis data frames #2765

pablo-gar commented Jul 2, 2024 •

edited

Loading

johnkerl commented Jul 24, 2024 •

edited

Loading

johnkerl commented Jul 25, 2024 •

edited

Loading