Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] ExperimentAxisQuery.to_anndata() (and R analogs) should drop unused categories on axis data frames #2765

Closed
pablo-gar opened this issue Jul 2, 2024 · 8 comments · Fixed by #2811 or #2825
Assignees

Comments

@pablo-gar
Copy link
Member

pablo-gar commented Jul 2, 2024

Is your feature request related to a problem? Please describe.
Some analytical pipelines, specially those that relate to visualizations rely on the categories of pandas.Categorical. In the case of large SOMAExperiments, many times a query will result on unused categories for potentially many columns of obs or var, thus the user needs to always iterate on all columns and perform a cat.remove_unused_categories() operation.

See for example this reproducible example

import cellxgene_census
import scanpy as sc
census = cellxgene_census.open_soma(census_version="2024-05-20")

human = census["census_data"]["homo_sapiens"]
query = human.axis_query(
    measurement_name = "RNA",
    obs_query = tiledbsoma.AxisQuery(
        value_filter = "tissue == 'tongue' and is_primary_data == True"
    )
)

adata = query.to_anndata(column_names={"obs": ["tissue"]}, X_name = "normalized")
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="tissue")

Only one "tissue" was selected but all hundreds of tissues are drawn in the umap

image

Describe the solution you'd like
ExperimentAxisQuery.to_anndata() returns an anndata with unused categories already removed in the axis data frames

@johnkerl johnkerl self-assigned this Jul 5, 2024
@johnkerl
Copy link
Member

Needs Python & R both. We can subtask.

@johnkerl johnkerl assigned nguyenv and unassigned johnkerl Jul 15, 2024
@eddelbuettel
Copy link
Contributor

For R it is a pretty common task and a base R function

> example(droplevels)

drplvl> aq <- transform(airquality, Month = factor(Month, labels = month.abb[5:9]))

drplvl> aq <- subset(aq, Month != "Jul")

drplvl> table(           aq $Month)

May Jun Jul Aug Sep 
 31  30   0  31  30 

drplvl> table(droplevels(aq)$Month)

May Jun Aug Sep 
 31  30  31  30 
> 

@johnkerl johnkerl changed the title [Feature request] ExperimentAxisQuery.to_anndata() should drop unused categories on axis data frames [Feature request] ExperimentAxisQuery.to_anndata() (and R analogs) should drop unused categories on axis data frames Jul 15, 2024
@mojaveazure
Copy link
Member

I get why this is asked for, but I would not want to do it by default in the R API. R explicitly keeps unused factor levels, and while this is frustrating, it's a practice I adhere to even in Seurat. We make this easier in Seurat by exposing drop arguments to drop unused factor levels when needed (set to FALSE by default to adhere to R) and building methods to droplevels() to automagically prune unused factor levels. However, we require explicit user input to drop unused levels in order to fit in with the R way of doing things

@eddelbuettel
Copy link
Contributor

It's a topic that has generated heated online discussions in other places. @pablo-gar raises a good point with respect to the plot and its 'inflated' legend -- but as @mojaveazure noted there is also a consensus that retaining factor levels as default is preferable. One can compare this to enums in C/C++ where I find the analogy of the picky compiler warning asking us to supply all levels of an enum when we write a switch appropriate: doing otherwise may lead to gnarly silent bugs. This is a tricky question. My preference would be to do what Seurat does and offer an option to drop if requested, and I lean towards a default of 'off'.

@johnkerl
Copy link
Member

johnkerl commented Jul 24, 2024

Let's make this opt-in for Python and R both (default: keep all levels; drop levels only when requested) -- @pablo-gar will this meet your needs?

@johnkerl
Copy link
Member

johnkerl commented Jul 25, 2024

Needs R PR for Seurat/SCE export

@johnkerl johnkerl reopened this Jul 25, 2024
@eddelbuettel
Copy link
Contributor

While I am assigned may I suggest this is pivoted over to @mojaveazure instead?

@johnkerl johnkerl assigned mojaveazure and unassigned eddelbuettel and nguyenv Jul 25, 2024
@ivirshup
Copy link
Collaborator

I agree with opt in, we're going to move anndata over to this.

We initially went with opt out because categoricals were new in pandas and a lot of code threw errors when there were unused categories. It's generally much better now, and there are cases where you want to retain the categories.

mojaveazure added a commit that referenced this issue Aug 2, 2024
…stors

R analog of #2811 and single-cell-data/SOMA#204; add a `drop_levels`
paramter to the ecosystem outgestors to drop unused factor levels from
resulting data frames

Modified SOMA methods:
 - `SOMAExperimentAxisQuery$to_seurat()`: add `drop_levels` to drop
   drop unused levels from `obs` and `var` data frames
 - `SOMAExperimentAxisQuery$to_seurat_assay()`: add `drop_levels` to
   drop unused levels from `var` data frame
 - `SOMAExperimentAxisQuery$to_single_cell_experiment()`: add
   `drop_levels` to drop unused levels from `obs` and `var` data frames

Also shifts `SOMAExperimentAxisQuery$to_seurat()` and
`SOMAExperimentAxisQuery$to_seurat_assay()` to use
`SOMAExperimentAxisQuery$private$.load_df()` for loading `obs` and
`var`; removing standalone code and increase sharing with the SCE
outgestor

resolves #2765

[SC-51945](https://app.shortcut.com/tiledb-inc/story/51945)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment