Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Handling of column_names is inconsistent in ExperimentAxisQuery.to_anndata #755

Closed
bkmartinjr opened this issue Jan 19, 2023 · 0 comments · Fixed by single-cell-data/SOMA#89
Assignees
Labels
bug Something isn't working python-api

Comments

@bkmartinjr
Copy link
Member

bkmartinjr commented Jan 19, 2023

In most of the SOMA API, column_names handling is consistent -- the returned result for a DataFrame read will include the union of:

  • columns requested via column_names
  • plus, any columns referenced in the value_filter string
    This is true for DataFrame.read(), ExperimentAxisQuery.obs(), and ExperimentAxisQuery.var().

There is a bug in ExperimentAxisQuery.to_anndata(), which does not include the columns referenced in the value_filter. It should work the same as the other interfaces.

The bug is in ExperimentAxisQuery._read_axis_dataframe(), which is overly aggressive in throwing away columns.

Example:

In [15]: next(mouse.obs.read(column_names=['soma_joinid'], value_filter='tissue_general=="lung"'))
Out[15]: 
pyarrow.Table
soma_joinid: int64
tissue_general: large_string
----
soma_joinid: [[1919575,1919576,1919577,1919578,1919579,...,2658971,2658972,2658973,2658974,2658975]]
tissue_general: [["lung","lung","lung","lung","lung",...,"lung","lung","lung","lung","lung"]]

In [16]: with mouse.axis_query("RNA", obs_query=soma.AxisQuery(value_filter='tissue_general == "lung"')) as query:
    ...:     print(next(query.obs(column_names=['soma_joinid'])).to_pandas())
    ...: 
        soma_joinid tissue_general
0           1919575           lung
1           1919576           lung
2           1919577           lung
3           1919578           lung
4           1919579           lung
...             ...            ...
127305      2658971           lung
127306      2658972           lung
127307      2658973           lung
127308      2658974           lung
127309      2658975           lung

[127310 rows x 2 columns]

In [17]: with mouse.axis_query("RNA", obs_query=soma.AxisQuery(value_filter='tissue_general == "lung"')) as query:
    ...:     print(query.to_anndata("raw", column_names={'obs': ['soma_joinid']}))
    ...: 
AnnData object with n_obs × n_vars = 127310 × 52392
    obs: 'soma_joinid'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python-api
Projects
None yet
2 participants