SOMA-component dataframe/schema accessors #104

johnkerl · 2022-05-20T16:03:06Z

Summary

These are schema-introspection accessors. In an upcoming PR, we'll have an examples/ directory, along with some vignette-style documentation.

Notes

These are largely tiledb-agnostic -- returning pandas.DataFrame etc. There is, however, a foo.tiledb_array_schema() method
At present the single X layers are soma.X.data and soma.raw.X.data. In a future PR these will be soma.X["data"] and soma.raw.X["data"] along with an add-layer functionality.
At present (on this PR) there are three ways to get schema-related info:
- soma.foo.df().keys() and some.foo.df().dtypes` -- loads a dataframe and returns pandas/numpy types
  - Plus: not storage-specific/tiledb-specific
  - Minus: requires loading the df. We can do soma.foo.df(["nonesuch"]).dtypes to get a "no such answer" query but this is a little clumsy -- maybe I can hide within a method
  - Note we're not reallyh doing work here -- the underlying TileDB-Py engine returns a Pandas dataframe, and we simply ask the Pandas dataframe what its types are.
- soma.foo.tiledb_array_schema()
  - Plusses: works right now & is authoritative & doesn't require dataframe load
  - Minus: tiledb-specific
- Solution: more storage-independent accessors
  - On this PR we have soma.foo.dim_names_to_types() and soma.foo.attr_names_to_types() -- these load the tiledb array schema (without loading the df) & then walk through that & pull out types in a storage-independent way.
  - See also Arrow type-system for SOMA #105 regarding a possible future Arrow type-system for SOMA.
Key driving use-case at present is for @ebezzi 's first tranche of work -- I'm eager to get feedback on this PR & make it what is most needed

Examples:

>>> import tiledbsc as t
>>> soma = t.SOMA('tabula-sapiens-epithelial')

soma.obs

>>> soma.obs. <--- tab-complete
soma.obs.attribute_filter(    soma.obs.from_dataframe(      soma.obs.keys(                soma.obs.tiledb_array_schema(
soma.obs.ctx                  soma.obs.get_attr_names(      soma.obs.name                 soma.obs.to_dataframe(
soma.obs.df(                  soma.obs.get_dim_names(       soma.obs.object_type(         soma.obs.uri
soma.obs.dim_name             soma.obs.has_attr_name(       soma.obs.open(                soma.obs.verbose
soma.obs.dim_select(          soma.obs.ids(                 soma.obs.shape(
soma.obs.exists(              soma.obs.indent               soma.obs.soma_options

>>> soma.obs.dim_name
'obs_id'

>>> soma.obs.keys()
['tissue_in_publication', 'assay_ontology_term_id', 'donor', 'anatomical_information', 'n_counts_UMIs', 'n_genes', 'cell_ontology_class', 'free_annotation', 'manually_annotated', 'compartment', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'ethnicity', 'development_stage']
>>>

>>> soma.obs.dim_names_to_types()
{'obs_id': dtype('S')}

>>> soma.obs.attr_names_to_types()
{'orig.ident': dtype('int32'), 'nCount_RNA': dtype('float64'), 'nFeature_RNA': dtype('int32'), 'RNA_snn_res.0.8': dtype('int32'), 'letter.idents': dtype('int32'), 'groups': dtype('<U'), 'RNA_snn_res.1': dtype('int32')}

Loaded dataframe:

>>> soma.obs.df()
                                                   tissue_in_publication assay_ontology_term_id  ...                           ethnicity        development_stage
obs_id                                                                                           ...
AAACCCAAGAACTCCT_TSP14_Lung_Distal_10X_1_1                          Lung            EFO:0009922  ...                            European  59-year-old human stage
AAACCCAAGAGGGTAA_TSP8_Prostate_NA_10X_1_1                       Prostate            EFO:0009922  ...          Hispanic or Latin American  56-year-old human stage
AAACCCAAGCCACTCG_TSP14_Prostate_NA_10X_1_2                      Prostate            EFO:0009922  ...                            European  59-year-old human stage
AAACCCAAGCCGGAAT_TSP14_Liver_NA_10X_1_1                            Liver            EFO:0009922  ...                            European  59-year-old human stage
AAACCCAAGCCTTGAT_TSP7_Tongue_Posterior_10X_1_1                    Tongue            EFO:0009922  ...                            European  69-year-old human stage
...                                                                  ...                    ...  ...                                 ...                      ...
TTTGTTGTCTACGGTA_TSP5_Eye_NA_10X_1_2                                 Eye            EFO:0009922  ...                            European  40-year-old human stage
TTTGTTGTCTATCGGA_TSP2_Lung_proxmedialdistal_10X...                  Lung            EFO:0009922  ...  African American or Afro-Caribbean  61-year-old human stage
TTTGTTGTCTCTCAAT_TSP2_Kidney_NA_10X_1_2                           Kidney            EFO:0009922  ...  African American or Afro-Caribbean  61-year-old human stage
TTTGTTGTCTGCCTGT_TSP4_Mammary_NA_10X_1_2                         Mammary            EFO:0009922  ...  African American or Afro-Caribbean  38-year-old human stage
TTTGTTGTCTGTAACG_TSP14_Prostate_NA_10X_1_1                      Prostate            EFO:0009922  ...                            European  59-year-old human stage

[104148 rows x 26 columns]

>>> soma.obs.df(['TTTGTTGTCTACGGTA_TSP5_Eye_NA_10X_1_2'])
                                     tissue_in_publication assay_ontology_term_id donor anatomical_information  ...     sex  tissue ethnicity        development_stage
obs_id                                                                                                          ...
TTTGTTGTCTACGGTA_TSP5_Eye_NA_10X_1_2                   Eye            EFO:0009922  TSP5                    nan  ...  female  b'eye'  European  40-year-old human stage

[1 rows x 26 columns]

>>> soma.obs.df().dtypes
tissue_in_publication                  object
assay_ontology_term_id                 object
donor                                  object
anatomical_information                 object
n_counts_UMIs                         float32
n_genes                                 int64
cell_ontology_class                    object
free_annotation                        object
manually_annotated                      uint8
compartment                            object
sex_ontology_term_id                   object
is_primary_data                         uint8
organism_ontology_term_id              object
disease_ontology_term_id               object
ethnicity_ontology_term_id             object
development_stage_ontology_term_id     object
cell_type_ontology_term_id             object
tissue_ontology_term_id                object
cell_type                              object
assay                                  object
disease                                object
organism                               object
sex                                    object
tissue                                 object
ethnicity                              object
development_stage                      object
dtype: object

soma.var

>>> soma.var.dim_name
'var_id'

>>> soma.var.keys()
['feature_type', 'ensemblid', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std', 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference']

>>> soma.var.dim_names_to_types()
{'var_id': dtype('S')}

>>> soma.var.attr_names_to_types()
{'vst.mean': dtype('float64'), 'vst.variance': dtype('float64'), 'vst.variance.expected': dtype('float64'), 'vst.variance.standardized': dtype('float64'), 'vst.variable': dtype('int32')}

Loaded dataframe:

>>> soma.var.df()
                    feature_type           ensemblid  highly_variable     means  dispersions  ...       std  feature_biotype  feature_is_filtered        feature_name  feature_reference
var_id                                                                                        ...
ENSG00000000003  Gene Expression  ENSG00000000003.14                0  0.137661     1.124491  ...  0.267483             gene                    0           b'TSPAN6'     NCBITaxon:9606
ENSG00000000005  Gene Expression   ENSG00000000005.6                1  0.018164     3.680616  ...  0.091566             gene                    0             b'TNMD'     NCBITaxon:9606
ENSG00000000419  Gene Expression  ENSG00000000419.12                0  0.273620     1.330586  ...  0.370115             gene                    0             b'DPM1'     NCBITaxon:9606
ENSG00000000457  Gene Expression  ENSG00000000457.14                0  0.085826     0.832207  ...  0.213709             gene                    0            b'SCYL3'     NCBITaxon:9606
ENSG00000000460  Gene Expression  ENSG00000000460.17                0  0.029752     1.076580  ...  0.124257             gene                    0         b'C1orf112'     NCBITaxon:9606
...                          ...                 ...              ...       ...          ...  ...       ...              ...                  ...                 ...                ...
ENSG00000286268  Gene Expression   ENSG00000286268.1                0  0.000961     1.718428  ...  0.022551             gene                    0  b'LL0XNC01-30I4.1'     NCBITaxon:9606
ENSG00000286269  Gene Expression   ENSG00000286269.1                0  0.001942     4.303867  ...  0.028257             gene                    0     b'RP11-510D4.1'     NCBITaxon:9606
ENSG00000286270  Gene Expression   ENSG00000286270.1                0  0.000055     0.746089  ...  0.005323             gene                    0  b'XXyac-YX60D10.3'     NCBITaxon:9606
ENSG00000286271  Gene Expression   ENSG00000286271.1                0  0.000602     2.585975  ...  0.016921             gene                    0    b'CTD-2201E18.6'     NCBITaxon:9606
ENSG00000286272  Gene Expression   ENSG00000286272.1                0  0.001834     0.572325  ...  0.031135             gene                    0     b'RP11-444B5.1'     NCBITaxon:9606

[58559 rows x 12 columns]

>>> soma.var.df().dtypes
feature_type            object
ensemblid               object
highly_variable          uint8
means                  float64
dispersions            float64
dispersions_norm       float32
mean                   float64
std                    float64
feature_biotype         object
feature_is_filtered      uint8
feature_name            object
feature_reference       object
dtype: object

>>> soma.var.tiledb_array_schema()
ArraySchema(
  domain=Domain(*[
    Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='feature_type', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='ensemblid', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='highly_variable', dtype='uint8', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='means', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='dispersions', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='dispersions_norm', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='mean', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='std', dtype='float64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='feature_biotype', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='feature_is_filtered', dtype='uint8', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='feature_name', dtype='|S0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='feature_reference', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

soma.X.data

>>> soma.X.data.dim_names()
('obs_id', 'var_id')

>>> soma.X.data.df(['nonesuch','nonesuch']).dtypes
obs_id     object
var_id     object
value     float32
dtype: object

>>> soma.X.data.dim_names_to_types()
{'obs_id': dtype('S'), 'var_id': dtype('S')}

>>> soma.X.data.attr_names_to_types()
{'value': dtype('float64')}

Loaded dataframe:

>>> soma.X.data.df(['AAACCCAAGAGGGTAA_TSP8_Prostate_NA_10X_1_1'],['ENSG00000286269'])
Empty DataFrame
Columns: [obs_id, var_id, value]
Index: []
>>> soma.X.data.df(['AAACCCAAGAGGGTAA_TSP8_Prostate_NA_10X_1_1'],['ENSG00000286269']).dtypes
obs_id     object
var_id     object
value     float32
dtype: object
>>>

soma.obsm

>>> soma.obsm.keys()
['X_tsne', 'X_pca']

>>> soma.obsm['X_pca'].dim_names_to_types()
{'obs_id': dtype('S')}

>>> soma.obsm['X_pca'].attr_names_to_types()
{'X_pca_1': dtype('float64'), 'X_pca_2': dtype('float64'), 'X_pca_3': dtype('float64'), 'X_pca_4': dtype('float64'), 'X_pca_5': dtype('float64'), 'X_pca_6': dtype('float64'), 'X_pca_7': dtype('float64'), 'X_pca_8': dtype('float64'), 'X_pca_9': dtype('float64'), 'X_pca_10': dtype('float64'), 'X_pca_11': dtype('float64'), 'X_pca_12': dtype('float64'), 'X_pca_13': dtype('float64'), 'X_pca_14': dtype('float64'), 'X_pca_15': dtype('float64'), 'X_pca_16': dtype('float64'), 'X_pca_17': dtype('float64'), 'X_pca_18': dtype('float64'), 'X_pca_19': dtype('float64')}

Loaded dataframe:

>>> soma.obsm['X_pca'].df()
            obs_id   X_pca_1   X_pca_2   X_pca_3   X_pca_4  ...  X_pca_15  X_pca_16  X_pca_17  X_pca_18  X_pca_19
0   AAATTCGAATCACG -0.599730  0.970809  2.640582 -0.295104  ... -0.186299 -0.383766  0.013765 -0.127195  0.026804
1   AAGCAAGAGCTTAG -0.919219 -2.043828 -0.173918  0.109657  ... -0.232017 -0.016800 -0.172934  0.009386 -0.019896
2   AAGCGACTTTGACG -1.380380  1.284101  1.918055  1.247647  ...  0.117602  0.169661  0.130810 -0.041855  0.027550
3   AATGCGTGGACGGA -1.494413  1.783583  0.661433 -0.584200  ... -0.163601 -0.082049 -0.145453 -0.045108  0.055206
4   AATGTTGACAGTCA -0.487798 -1.162107 -0.306267  0.702189  ...  0.045102 -0.267228  0.461754 -0.167920  0.026950
..             ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...
75  TTACGTACGTTCAG  8.858789 -0.195728 -0.453889  0.088018  ... -0.069778 -0.208222  0.792830 -0.428616 -0.294552
76  TTGAGGACTACGCA -0.917909  1.610199 -3.206508 -2.071275  ...  0.638907 -0.427769 -0.001314 -0.113981 -0.110469
77  TTGCATTGAGCTAC -0.997103 -0.155518 -0.595261  0.234394  ...  0.113765  0.263183  0.124576  0.056784 -0.027052
78  TTGGTACTGAATCC -1.418764  0.764986 -2.457070  4.852126  ... -0.232547 -0.472155  0.000728 -0.064450  0.099118
79  TTTAGCTGTACTCT -1.447483  1.583223  0.457882 -0.495953  ...  0.013788  0.087907 -0.088049 -0.014984  0.033992

[80 rows x 20 columns]

>>> soma.obsm['X_pca'].df().dtypes
obs_id       object
X_pca_1     float64
X_pca_2     float64
X_pca_3     float64
X_pca_4     float64
X_pca_5     float64
X_pca_6     float64
X_pca_7     float64
X_pca_8     float64
X_pca_9     float64
X_pca_10    float64
X_pca_11    float64
X_pca_12    float64
X_pca_13    float64
X_pca_14    float64
X_pca_15    float64
X_pca_16    float64
X_pca_17    float64
X_pca_18    float64
X_pca_19    float64
dtype: object

bkmartinjr

This PR good & reasonable to me as long as we rename foo.schema() to foo.tiledb_schema(), and continue to work toward a portable SOMA schema() function for these slots (eg, something similar to the Arrow proposal on the table). That would enable both portable use, and remove the need to fetch data before introspecting.

johnkerl · 2022-05-20T17:24:21Z

This PR good & reasonable to me as long as we rename foo.schema() to foo.tiledb_schema(), and continue to work toward a portable SOMA schema() function for these slots (eg, something similar to the Arrow proposal on the table). That would enable both portable use, and remove the need to fetch data before introspecting.

@bkmartinjr fantastic! :)

foo.schema() is now foo.tiledb_array_schema().

And I created #105 to track discussion around the Arrow-type-system idea -- which I think is a great idea, but IIUC we need some buy-in from folks before coding.

johnkerl added 2 commits May 20, 2022 11:41

matrix.df() accessors

9c3c839

unit-test cases

aac677c

johnkerl requested review from aaronwolen, bkmartinjr and ebezzi May 20, 2022 16:12

johnkerl marked this pull request as ready for review May 20, 2022 16:13

bkmartinjr approved these changes May 20, 2022

View reviewed changes

johnkerl changed the title ~~Matrix.df() accessors~~ SOMA-component dataframe/schema accessors May 20, 2022

johnkerl force-pushed the kerl/matrix-df-accessors branch from dcbb468 to ebbdb32 Compare May 20, 2022 17:14

soma.foo.schema() -> soma.foo.tiledb_array_schema()

30f8d25

johnkerl force-pushed the kerl/matrix-df-accessors branch from ebbdb32 to 30f8d25 Compare May 20, 2022 18:31

ebezzi approved these changes May 20, 2022

View reviewed changes

johnkerl merged commit 89923c1 into main May 20, 2022

johnkerl mentioned this pull request May 24, 2022

Progress tracker #113

Closed

61 tasks

johnkerl deleted the kerl/matrix-df-accessors branch June 1, 2022 13:25

johnkerl added a commit that referenced this pull request Sep 1, 2022

Apply #104 to scripts/test

af11d45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOMA-component dataframe/schema accessors #104

SOMA-component dataframe/schema accessors #104

johnkerl commented May 20, 2022 •

edited

Loading

bkmartinjr left a comment

johnkerl commented May 20, 2022

SOMA-component dataframe/schema accessors #104

SOMA-component dataframe/schema accessors #104

Conversation

johnkerl commented May 20, 2022 • edited Loading

Summary

Notes

soma.obs

soma.var

soma.X.data

soma.obsm

bkmartinjr left a comment

Choose a reason for hiding this comment

johnkerl commented May 20, 2022

johnkerl commented May 20, 2022 •

edited

Loading