-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] ExperimentAxisQuery.to_anndata()
(and R analogs) should drop unused categories on axis data frames
#2765
Comments
Needs Python & R both. We can subtask. |
For R it is a pretty common task and a base R function > example(droplevels)
drplvl> aq <- transform(airquality, Month = factor(Month, labels = month.abb[5:9]))
drplvl> aq <- subset(aq, Month != "Jul")
drplvl> table( aq $Month)
May Jun Jul Aug Sep
31 30 0 31 30
drplvl> table(droplevels(aq)$Month)
May Jun Aug Sep
31 30 31 30
> |
ExperimentAxisQuery.to_anndata()
should drop unused categories on axis data frames ExperimentAxisQuery.to_anndata()
(and R analogs) should drop unused categories on axis data frames
I get why this is asked for, but I would not want to do it by default in the R API. R explicitly keeps unused factor levels, and while this is frustrating, it's a practice I adhere to even in Seurat. We make this easier in Seurat by exposing |
It's a topic that has generated heated online discussions in other places. @pablo-gar raises a good point with respect to the plot and its 'inflated' legend -- but as @mojaveazure noted there is also a consensus that retaining factor levels as default is preferable. One can compare this to |
Let's make this opt-in for Python and R both (default: keep all levels; drop levels only when requested) -- @pablo-gar will this meet your needs? |
Needs R PR for Seurat/SCE export |
While I am assigned may I suggest this is pivoted over to @mojaveazure instead? |
I agree with opt in, we're going to move anndata over to this. We initially went with opt out because categoricals were new in pandas and a lot of code threw errors when there were unused categories. It's generally much better now, and there are cases where you want to retain the categories. |
…stors R analog of #2811 and single-cell-data/SOMA#204; add a `drop_levels` paramter to the ecosystem outgestors to drop unused factor levels from resulting data frames Modified SOMA methods: - `SOMAExperimentAxisQuery$to_seurat()`: add `drop_levels` to drop drop unused levels from `obs` and `var` data frames - `SOMAExperimentAxisQuery$to_seurat_assay()`: add `drop_levels` to drop unused levels from `var` data frame - `SOMAExperimentAxisQuery$to_single_cell_experiment()`: add `drop_levels` to drop unused levels from `obs` and `var` data frames Also shifts `SOMAExperimentAxisQuery$to_seurat()` and `SOMAExperimentAxisQuery$to_seurat_assay()` to use `SOMAExperimentAxisQuery$private$.load_df()` for loading `obs` and `var`; removing standalone code and increase sharing with the SCE outgestor resolves #2765 [SC-51945](https://app.shortcut.com/tiledb-inc/story/51945)
Is your feature request related to a problem? Please describe.
Some analytical pipelines, specially those that relate to visualizations rely on the categories of
pandas.Categorical
. In the case of large SOMAExperiments, many times a query will result on unused categories for potentially many columns ofobs
orvar
, thus the user needs to always iterate on all columns and perform acat.remove_unused_categories()
operation.See for example this reproducible example
Only one
"tissue"
was selected but all hundreds of tissues are drawn in the umapDescribe the solution you'd like
ExperimentAxisQuery.to_anndata()
returns an anndata with unused categories already removed in the axis data framesThe text was updated successfully, but these errors were encountered: