diff --git a/apis/python/README-ingestion.md b/apis/python/README-ingestion.md
new file mode 100644
index 0000000000..13d0677154
--- /dev/null
+++ b/apis/python/README-ingestion.md
@@ -0,0 +1,705 @@
+# TL;DR
+
+```
+./tools/desc-ann ./anndata/pbmc3k_processed.h5ad
+
+./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed
+
+./tools/desc-soma ./tiledb-data/pbmc3k_processed
+```
+
+# Overview
+
+* Sample data:
+ * [anndata](./anndata) contains some files from [https://cellxgene.cziscience.com](https://cellxgene.cziscience.com)
+ * The most important reference is `anndata/pbmc3k_processed.h5ad`
+* Code:
+ * [./src/tiledbsc](./src/tiledbsc)
+* Inspecting HDF5 input files
+ * `./tools/desc-ann ./anndata/pbmc3k_processed.h5ad`
+* Ingesting
+ * `./tools/ingestor ./anndata/pbmc3k_processed.h5ad`
+ * Output is in `tiledb-data/pbmc3k_processed`
+ * Cloud-upload test:
+ * `tools/ingestor ./anndata/pbmc3k_processed.h5ad tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/wpv2-test-001`
+* Inspecting TileDB output groups
+ * `./tools/desc-soma ./tiledb-data/pbmc3k_processed`
+
+# Details
+
+## TileDB group structure
+
+Also shown: mental map for class names
+
+```
+TileDB group structure SOMA classes AnnData types
+
+soma: group
+|
++-- X: group AssayMatrixGroup
+| +-- data: array AssayMatrix scipy.sparse.csr_matrix, numpy.ndarray
+|
++-- obs: array AnnotationDataFrame pandas.DataFrame
+|
++-- var: array AnnotationDataFrame pandas.DataFrame
+|
++-- obsm: group AnnotationMatrixGroup dict of:
+| +-- omfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
+| +-- ombar: array AnnotationMatrix
+|
++-- varm: group AnnotationMatrixGroup dict of:
+| +-- vmfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
+| +-- vmbar: array AnnotationMatrix
+|
++-- obsp: group AnnotationPairwiseMatrixGroup dict of:
+| +-- opfoo: array AnnotationPairwiseMatrix scipy.sparse.csr_matrix, numpy.ndarray
+| +-- opbar: array AnnotationPairwiseMatrix
+|
++-- varp: group AnnotationPairwiseMatrixGroup dict of:
+| +-- vpfoo: array AnnotationPairwiseMatrix scipy.sparse.csr_matrix, numpy.ndarray
+| +-- vpbar: array AnnotationPairwiseMatrix
+|
++-- raw: group RawGroup
+| |
+| +-- X: group AssayMatrixGroup
+| | +-- data: array AssayMatrix scipy.sparse.csr_matrix
+| |
+| +-- var: array AnnotationDataFrame pandas.DataFrame
+| |
+| +-- varm: group AnnotationMatrixGroup
+| | +-- vmfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
+| | +-- vmbar: array AnnotationMatrix
+|
++-- raw: group UnsGroup
+ +-- ...: group
+ | +--: array pandas.DataFrame, or
+ | +--: array numpy.ndarray, or
+ | +--: array numpy scalars,
+ | +--: array etc.
+ | +--: group
+ | +--: group
+ | +... etc (nestable)
+ |
+ +-- ...: group
+```
+
+## Example data
+
+This serves both as a concrete example of what the data looks like, as well as some of the soma methods.
+
+Look at information about a sample `.h5ad` file:
+
+
+
+```
+$ ./tools/desc-ann anndata/pbmc-small.h5ad
+
+================================================================ anndata/pbmc-small.h5ad
+
+----------------------------------------------------------------
+ANNDATA SUMMARY:
+AnnData object with n_obs × n_vars = 80 × 20
+ obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.8', 'letter.idents', 'groups', 'RNA_snn_res.1'
+ var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
+ uns: 'neighbors'
+ obsm: 'X_pca', 'X_tsne'
+ varm: 'PCs'
+ obsp: 'distances'
+X SHAPE (80, 20)
+OBS LEN 80
+VAR LEN 20
+OBS IS A
+ orig.ident int32
+ nCount_RNA float64
+ nFeature_RNA int32
+ RNA_snn_res.0.8 int32
+ letter.idents int32
+ groups category
+ RNA_snn_res.1 int32
+VAR IS A
+ vst.mean float64
+ vst.variance float64
+ vst.variance.expected float64
+ vst.variance.standardized float64
+ vst.variable int32
+RAW X SHAPE (80, 230)
+OBS KEYS ['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.8', 'letter.idents', 'groups', 'RNA_snn_res.1']
+VAR KEYS ['vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable']
+OBSM KEYS ['X_pca', 'X_tsne']
+VARM KEYS ['PCs']
+OBSP KEYS ['distances']
+VARP KEYS []
+uns/neighbors/params/method
+
+----------------------------------------------------------------
+ANNDATA FILE TYPES:
+X/data
+X/data shape (80, 20)
+X/data dtype float64
+X/raw
+X/raw shape (80, 230)
+X/data dtype float64
+X/raw density 0.2422
+obs
+var
+obsm/X_pca
+obsm/X_tsne
+varm/PCs
+obsp/distances
+uns/neighbors/params/method (1,) object
+```
+
+See also:
+
+```
+h5ls anndata/pmbc-small.h5ad
+h5ls anndata/pbmc3k_processed.h5ad
+h5ls -r anndata/pbmc3k_processed.h5ad
+h5ls -vr anndata/pbmc3k_processed.h5ad
+# etc.
+```
+
+
+
+Read a sample `.h5ad` file and write into a TileDB SOMA object:
+
+```
+$ tools/ingestor anndata/pbmc-small.h5ad tiledb-data/pbmc-small
+```
+
+Look at various fields:
+
+
+
+```
+$ python
+
+>>> import tiledbsc
+
+>>> soma = tiledbsc.SOMA('tiledb-data/pbmc-small')
+>>> arr = soma.X.data.open_array()
+>>> arr.df[:]
+ obs_id var_id value
+0 AAATTCGAATCACG AKR1C3 -0.325888
+1 AAATTCGAATCACG CA2 -0.346938
+... ... ... ...
+1598 TTTAGCTGTACTCT TUBB1 -0.350375
+1599 TTTAGCTGTACTCT VDAC3 -0.524551
+
+[1600 rows x 3 columns]
+-- Note this is a sparse matrix in IJV/COO format
+```
+
+```
+>>> arr = soma.obs.open_array()
+>>> arr.df[:]
+ orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups RNA_snn_res.1
+obs_id
+AAATTCGAATCACG 0 327.0 62 1 1 g2 1
+AAGCAAGAGCTTAG 0 126.0 48 0 0 g1 0
+... ... ... ... ... ... ... ...
+TTGGTACTGAATCC 0 135.0 45 0 0 g1 2
+TTTAGCTGTACTCT 0 462.0 86 1 1 g1 1
+
+[80 rows x 7 columns]
+
+>>> arr = soma.var.open_array()
+>>> arr.df[:]
+ vst.mean vst.variance vst.variance.expected vst.variance.standardized vst.variable
+var_id
+AKR1C3 0.2625 1.132753 0.553424 2.021191 1
+CA2 0.4500 3.263291 1.685451 1.765922 1
+CD1C 0.1750 0.576582 0.271217 2.052014 1
+...
+TREML1 0.3375 1.365665 0.761869 1.792519 1
+TUBB1 0.8875 16.202373 6.352400 1.634371 1
+VDAC3 1.1250 30.971519 8.986513 2.137607 1
+```
+
+```
+>>> soma.obsm._get_member_names()
+['X_tsne', 'X_pca']
+>>> arr = soma.obsm['X_pca'].open_array()
+>>> arr.df[:]
+ obs_id X_pca_1 X_pca_2 ... X_pca_18 X_pca_19
+0 AAATTCGAATCACG -0.599730 0.970809 ... -0.127195 0.026804
+1 AAGCAAGAGCTTAG -0.919219 -2.043828 ... 0.009386 -0.019896
+2 AAGCGACTTTGACG -1.380380 1.284101 ... -0.041855 0.027550
+.. ... ... ... ... ... ...
+78 TTGGTACTGAATCC -1.418764 0.764986 ... -0.064450 0.099118
+79 TTTAGCTGTACTCT -1.447483 1.583223 ... -0.014984 0.033992
+
+[80 rows x 20 columns]
+```
+
+```
+>>> arr = soma.raw.X.data.open_array()
+>>> arr.df[:]
+ obs_id var_id value
+0 AAATTCGAATCACG ADAR 3.452557
+1 AAATTCGAATCACG AIF1 3.452557
+... ... ... ...
+4454 TTTAGCTGTACTCT XBP1 3.119940
+4455 TTTAGCTGTACTCT ZFP36L1 3.119940
+
+[4456 rows x 3 columns]
+```
+
+```
+>>> soma.uns._get_member_names()
+['neighbors']
+>>> soma.uns['neighbors']._get_member_names()
+['params']
+>>> soma.uns['neighbors']['params']._get_member_names()
+['method']
+>>> arr = soma.uns['neighbors']['params']['method'].open_array()
+>>> arr.df[:]
+ __dim_0
+0 0 snn
+```
+
+
+
+## Expected ingestion progress
+
+`./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed`
+
+
+
+```
+START SOMA.from_h5ad ./anndata/pbmc3k_processed.h5ad -> ./tiledb-data/pbmc3k_processed
+START READING ./anndata/pbmc3k_processed.h5ad
+FINISH READING ./anndata/pbmc3k_processed.h5ad TIME 0.227
+START DECATEGORICALIZING
+FINISH DECATEGORICALIZING TIME 0.006
+START WRITING ./tiledb-data/pbmc3k_processed
+Creating TileDB group ./tiledb-data/pbmc3k_processed
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/X
+ START WRITING ./tiledb-data/pbmc3k_processed/X/data
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/X/data TIME 9.262
+ START WRITING ./tiledb-data/pbmc3k_processed/obs
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obs TIME 0.085
+ START WRITING ./tiledb-data/pbmc3k_processed/var
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/var TIME 0.024
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/obsm
+ START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_draw_graph_fr
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_draw_graph_fr TIME 0.053
+ START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_pca
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_pca TIME 0.247
+ START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_tsne
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_tsne TIME 0.059
+ START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_umap
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_umap TIME 0.058
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/varm
+ START WRITING ./tiledb-data/pbmc3k_processed/varm/PCs
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/varm/PCs TIME 0.175
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/obsp
+ START WRITING ./tiledb-data/pbmc3k_processed/obsp/connectivities
+ START __ingest_coo_data_string_dims_rows_chunked
+ START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=42406, 100.000%
+ FINISH chunk TIME 0.318
+ FINISH __ingest_coo_data_string_dims_rows_chunked TIME 0.825
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsp/connectivities TIME 0.831
+ START WRITING ./tiledb-data/pbmc3k_processed/obsp/distances
+ START __ingest_coo_data_string_dims_rows_chunked
+ START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=23742, 100.000%
+ FINISH chunk TIME 0.149
+ FINISH __ingest_coo_data_string_dims_rows_chunked TIME 0.637
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/obsp/distances TIME 0.647
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/varp
+ START WRITING ./tiledb-data/pbmc3k_processed/raw
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/raw
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/raw/X
+ START WRITING ./tiledb-data/pbmc3k_processed/raw/X/data
+ START __ingest_coo_data_string_dims_rows_chunked
+ START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=2238732, 100.000%
+ FINISH chunk TIME 5.736
+ FINISH __ingest_coo_data_string_dims_rows_chunked TIME 6.301
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/raw/X/data TIME 6.338
+ START WRITING ./tiledb-data/pbmc3k_processed/raw/var
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/raw/var TIME 0.056
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/raw/varm
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/raw TIME 6.400
+ START WRITING ./tiledb-data/pbmc3k_processed/uns
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/draw_graph
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph/params
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/draw_graph/params
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/layout
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/layout TIME 0.018
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/random_state
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/random_state TIME 0.016
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph/params TIME 0.036
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph TIME 0.038
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/louvain
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/louvain
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/louvain/params
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/louvain/params
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/random_state
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/random_state TIME 0.017
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/resolution
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/resolution TIME 0.017
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/louvain/params TIME 0.036
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/louvain TIME 0.039
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain_colors
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain_colors TIME 0.018
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/neighbors
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors/params
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/neighbors/params
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/method
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/method TIME 0.017
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/n_neighbors
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/n_neighbors TIME 0.017
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors/params TIME 0.036
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors TIME 0.038
+ START WRITING ./tiledb-data/pbmc3k_processed/uns/pca
+ Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/pca
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance TIME 0.016
+ START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance_ratio
+ FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance_ratio TIME 0.016
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/pca TIME 0.034
+ Skipping structured array: ./tiledb-data/pbmc3k_processed/uns/rank_genes_groups
+ FINISH WRITING ./tiledb-data/pbmc3k_processed/uns TIME 0.171
+FINISH WRITING ./tiledb-data/pbmc3k_processed TIME 18.037
+FINISH SOMA.from_h5ad ./anndata/pbmc3k_processed.h5ad -> ./tiledb-data/pbmc3k_processed TIME 18.271
+```
+
+
+
+## Expected output format
+
+`./tools/desc-soma ./tiledb-data/pbmc3k_processed`
+
+
+
+```
+----------------------------------------------------------------
+Array: ./tiledb-data/pbmc3k_processed/X/data
+ArraySchema(
+ domain=Domain(*[
+ Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
+ Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
+ ]),
+ attrs=[
+ Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
+ ],
+ cell_order='row-major',
+ tile_order='row-major',
+ capacity=100000,
+ sparse=True,
+ allows_duplicates=True,
+)
+
+----------------------------------------------------------------
+Array: ./tiledb-data/pbmc3k_processed/obs
+ArraySchema(
+ domain=Domain(*[
+ Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
+ ]),
+ attrs=[
+ Attr(name='n_genes', dtype='int64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
+ Attr(name='percent_mito', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
+ Attr(name='n_counts', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
+ Attr(name='louvain', dtype='
+
+See also:
+
+```
+import tiledb
+print(tiledb.group.Group('tiledb-data/pbmc3k_processed'))
+```
+
+## Diversity of formats in HDF5 files
+
+Due to file-size restrictions on GitHub, not all the following are cached in this repo.
+Nonetheless this is a selection from
+[https://cellxgene.cziscience.com](https://cellxgene.cziscience.com), as well as some
+raw-sensor data (`subset_100_100`):
+
+![anndata-filetypes](./images/isa.png)
+
+# Notes
+
+* `os.path.join` is used here but may not be appropriate if this package is run on Windows:
+ * `/` has been accepted in Windows paths for some years now
+ * `\` is not accepted for forming URIs
+ * So, perhaps safer would be to always join on `/` regardless of platform.
+* For PRs: from within `apis/python`, run `black .` (or `black . --check to preview if you prefer) as this format-checker is run in CI.
diff --git a/apis/python/README.md b/apis/python/README.md
index 0fb7373d67..65e04df15a 100644
--- a/apis/python/README.md
+++ b/apis/python/README.md
@@ -1,19 +1,6 @@
-This is test code for reading ANN data and writing into a TileDB nested group structure.
-
-# TL;DR
-
-```
-./tools/desc-ann ./anndata/pbmc3k_processed.h5ad
-
-./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed
-
-./tools/desc-soma ./tiledb-data/pbmc3k_processed
-```
-
# Installation
-This requires [`tiledb`](https://github.com/TileDB-Inc/TileDB-Py) 0.14.1 or above, in addition to other dependencies
-in [setup.cfg](./setup.cfg).
+This requires [`tiledb`](https://github.com/TileDB-Inc/TileDB-Py) (see [./setup.cfg](setup.cfg) for version), in addition to other dependencies in [setup.cfg](./setup.cfg).
After `cd` to `apis/python`:
@@ -35,702 +22,6 @@ Then:
python -m pytest tests
```
-# Overview
-
-* Sample data:
- * [anndata](./anndata) contains some files from [https://cellxgene.cziscience.com](https://cellxgene.cziscience.com)
- * The most important reference is `anndata/pbmc3k_processed.h5ad`
-* Code:
- * [./src/tiledbsc](./src/tiledbsc)
-* Inspecting HDF5 input files
- * `./tools/desc-ann ./anndata/pbmc3k_processed.h5ad`
-* Ingesting
- * `./tools/ingestor ./anndata/pbmc3k_processed.h5ad`
- * Output is in `tiledb-data/pbmc3k_processed`
- * Cloud-upload test:
- * `tools/ingestor ./anndata/pbmc3k_processed.h5ad tiledb://johnkerl-tiledb/s3://tiledb-johnkerl/wpv2-test-001`
-* Inspecting TileDB output groups
- * `./tools/desc-soma ./tiledb-data/pbmc3k_processed`
-
# Status
Please see [https://github.com/single-cell-data/TileDB-SingleCell/issues](https://github.com/single-cell-data/TileDB-SingleCell/issues).
-
-# Details
-
-## TileDB group structure
-
-Also shown: mental map for class names
-
-```
-TileDB group structure SOMA classes AnnData types
-
-soma: group
-|
-+-- X: group AssayMatrixGroup
-| +-- data: array AssayMatrix scipy.sparse.csr_matrix, numpy.ndarray
-|
-+-- obs: array AnnotationDataFrame pandas.DataFrame
-|
-+-- var: array AnnotationDataFrame pandas.DataFrame
-|
-+-- obsm: group AnnotationMatrixGroup dict of:
-| +-- omfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
-| +-- ombar: array AnnotationMatrix
-|
-+-- varm: group AnnotationMatrixGroup dict of:
-| +-- vmfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
-| +-- vmbar: array AnnotationMatrix
-|
-+-- obsp: group AnnotationPairwiseMatrixGroup dict of:
-| +-- opfoo: array AnnotationPairwiseMatrix scipy.sparse.csr_matrix, numpy.ndarray
-| +-- opbar: array AnnotationPairwiseMatrix
-|
-+-- varp: group AnnotationPairwiseMatrixGroup dict of:
-| +-- vpfoo: array AnnotationPairwiseMatrix scipy.sparse.csr_matrix, numpy.ndarray
-| +-- vpbar: array AnnotationPairwiseMatrix
-|
-+-- raw: group RawGroup
-| |
-| +-- X: group AssayMatrixGroup
-| | +-- data: array AssayMatrix scipy.sparse.csr_matrix
-| |
-| +-- var: array AnnotationDataFrame pandas.DataFrame
-| |
-| +-- varm: group AnnotationMatrixGroup
-| | +-- vmfoo: array AnnotationMatrix numpy.ndarray, scipy.sparse.csr_matrix
-| | +-- vmbar: array AnnotationMatrix
-|
-+-- raw: group UnsGroup
- +-- ...: group
- | +--: array pandas.DataFrame, or
- | +--: array numpy.ndarray, or
- | +--: array numpy scalars,
- | +--: array etc.
- | +--: group
- | +--: group
- | +... etc (nestable)
- |
- +-- ...: group
-```
-
-## Example data
-
-This serves both as a concrete example of what the data looks like, as well as some of the soma methods.
-
-Look at information about a sample `.h5ad` file:
-
-
-
-```
-$ ./tools/desc-ann anndata/pbmc-small.h5ad
-
-================================================================ anndata/pbmc-small.h5ad
-
-----------------------------------------------------------------
-ANNDATA SUMMARY:
-AnnData object with n_obs × n_vars = 80 × 20
- obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.8', 'letter.idents', 'groups', 'RNA_snn_res.1'
- var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
- uns: 'neighbors'
- obsm: 'X_pca', 'X_tsne'
- varm: 'PCs'
- obsp: 'distances'
-X SHAPE (80, 20)
-OBS LEN 80
-VAR LEN 20
-OBS IS A
- orig.ident int32
- nCount_RNA float64
- nFeature_RNA int32
- RNA_snn_res.0.8 int32
- letter.idents int32
- groups category
- RNA_snn_res.1 int32
-VAR IS A
- vst.mean float64
- vst.variance float64
- vst.variance.expected float64
- vst.variance.standardized float64
- vst.variable int32
-RAW X SHAPE (80, 230)
-OBS KEYS ['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.8', 'letter.idents', 'groups', 'RNA_snn_res.1']
-VAR KEYS ['vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable']
-OBSM KEYS ['X_pca', 'X_tsne']
-VARM KEYS ['PCs']
-OBSP KEYS ['distances']
-VARP KEYS []
-uns/neighbors/params/method
-
-----------------------------------------------------------------
-ANNDATA FILE TYPES:
-X/data
-X/data shape (80, 20)
-X/data dtype float64
-X/raw
-X/raw shape (80, 230)
-X/data dtype float64
-X/raw density 0.2422
-obs
-var
-obsm/X_pca
-obsm/X_tsne
-varm/PCs
-obsp/distances
-uns/neighbors/params/method (1,) object
-```
-
-See also:
-
-```
-h5ls anndata/pmbc-small.h5ad
-h5ls anndata/pbmc3k_processed.h5ad
-h5ls -r anndata/pbmc3k_processed.h5ad
-h5ls -vr anndata/pbmc3k_processed.h5ad
-# etc.
-```
-
-
-
-Read a sample `.h5ad` file and write into a TileDB SOMA object:
-
-```
-$ tools/ingestor anndata/pbmc-small.h5ad tiledb-data/pbmc-small
-```
-
-Look at various fields:
-
-
-
-```
-$ python
-
->>> import tiledbsc
-
->>> soma = tiledbsc.SOMA('tiledb-data/pbmc-small')
->>> arr = soma.X.data.open_array()
->>> arr.df[:]
- obs_id var_id value
-0 AAATTCGAATCACG AKR1C3 -0.325888
-1 AAATTCGAATCACG CA2 -0.346938
-... ... ... ...
-1598 TTTAGCTGTACTCT TUBB1 -0.350375
-1599 TTTAGCTGTACTCT VDAC3 -0.524551
-
-[1600 rows x 3 columns]
--- Note this is a sparse matrix in IJV/COO format
-```
-
-```
->>> arr = soma.obs.open_array()
->>> arr.df[:]
- orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups RNA_snn_res.1
-obs_id
-AAATTCGAATCACG 0 327.0 62 1 1 g2 1
-AAGCAAGAGCTTAG 0 126.0 48 0 0 g1 0
-... ... ... ... ... ... ... ...
-TTGGTACTGAATCC 0 135.0 45 0 0 g1 2
-TTTAGCTGTACTCT 0 462.0 86 1 1 g1 1
-
-[80 rows x 7 columns]
-
->>> arr = soma.var.open_array()
->>> arr.df[:]
- vst.mean vst.variance vst.variance.expected vst.variance.standardized vst.variable
-var_id
-AKR1C3 0.2625 1.132753 0.553424 2.021191 1
-CA2 0.4500 3.263291 1.685451 1.765922 1
-CD1C 0.1750 0.576582 0.271217 2.052014 1
-...
-TREML1 0.3375 1.365665 0.761869 1.792519 1
-TUBB1 0.8875 16.202373 6.352400 1.634371 1
-VDAC3 1.1250 30.971519 8.986513 2.137607 1
-```
-
-```
->>> soma.obsm._get_member_names()
-['X_tsne', 'X_pca']
->>> arr = soma.obsm['X_pca'].open_array()
->>> arr.df[:]
- obs_id X_pca_1 X_pca_2 ... X_pca_18 X_pca_19
-0 AAATTCGAATCACG -0.599730 0.970809 ... -0.127195 0.026804
-1 AAGCAAGAGCTTAG -0.919219 -2.043828 ... 0.009386 -0.019896
-2 AAGCGACTTTGACG -1.380380 1.284101 ... -0.041855 0.027550
-.. ... ... ... ... ... ...
-78 TTGGTACTGAATCC -1.418764 0.764986 ... -0.064450 0.099118
-79 TTTAGCTGTACTCT -1.447483 1.583223 ... -0.014984 0.033992
-
-[80 rows x 20 columns]
-```
-
-```
->>> arr = soma.raw.X.data.open_array()
->>> arr.df[:]
- obs_id var_id value
-0 AAATTCGAATCACG ADAR 3.452557
-1 AAATTCGAATCACG AIF1 3.452557
-... ... ... ...
-4454 TTTAGCTGTACTCT XBP1 3.119940
-4455 TTTAGCTGTACTCT ZFP36L1 3.119940
-
-[4456 rows x 3 columns]
-```
-
-```
->>> soma.uns._get_member_names()
-['neighbors']
->>> soma.uns['neighbors']._get_member_names()
-['params']
->>> soma.uns['neighbors']['params']._get_member_names()
-['method']
->>> arr = soma.uns['neighbors']['params']['method'].open_array()
->>> arr.df[:]
- __dim_0
-0 0 snn
-```
-
-
-
-## Expected ingestion progress
-
-`./tools/ingestor ./anndata/pbmc3k_processed.h5ad ./tiledb-data/pbmc3k_processed`
-
-
-
-```
-START SOMA.from_h5ad ./anndata/pbmc3k_processed.h5ad -> ./tiledb-data/pbmc3k_processed
-START READING ./anndata/pbmc3k_processed.h5ad
-FINISH READING ./anndata/pbmc3k_processed.h5ad TIME 0.227
-START DECATEGORICALIZING
-FINISH DECATEGORICALIZING TIME 0.006
-START WRITING ./tiledb-data/pbmc3k_processed
-Creating TileDB group ./tiledb-data/pbmc3k_processed
- Creating TileDB group ./tiledb-data/pbmc3k_processed/X
- START WRITING ./tiledb-data/pbmc3k_processed/X/data
- FINISH WRITING ./tiledb-data/pbmc3k_processed/X/data TIME 9.262
- START WRITING ./tiledb-data/pbmc3k_processed/obs
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obs TIME 0.085
- START WRITING ./tiledb-data/pbmc3k_processed/var
- FINISH WRITING ./tiledb-data/pbmc3k_processed/var TIME 0.024
- Creating TileDB group ./tiledb-data/pbmc3k_processed/obsm
- START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_draw_graph_fr
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_draw_graph_fr TIME 0.053
- START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_pca
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_pca TIME 0.247
- START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_tsne
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_tsne TIME 0.059
- START WRITING ./tiledb-data/pbmc3k_processed/obsm/X_umap
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsm/X_umap TIME 0.058
- Creating TileDB group ./tiledb-data/pbmc3k_processed/varm
- START WRITING ./tiledb-data/pbmc3k_processed/varm/PCs
- FINISH WRITING ./tiledb-data/pbmc3k_processed/varm/PCs TIME 0.175
- Creating TileDB group ./tiledb-data/pbmc3k_processed/obsp
- START WRITING ./tiledb-data/pbmc3k_processed/obsp/connectivities
- START __ingest_coo_data_string_dims_rows_chunked
- START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=42406, 100.000%
- FINISH chunk TIME 0.318
- FINISH __ingest_coo_data_string_dims_rows_chunked TIME 0.825
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsp/connectivities TIME 0.831
- START WRITING ./tiledb-data/pbmc3k_processed/obsp/distances
- START __ingest_coo_data_string_dims_rows_chunked
- START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=23742, 100.000%
- FINISH chunk TIME 0.149
- FINISH __ingest_coo_data_string_dims_rows_chunked TIME 0.637
- FINISH WRITING ./tiledb-data/pbmc3k_processed/obsp/distances TIME 0.647
- Creating TileDB group ./tiledb-data/pbmc3k_processed/varp
- START WRITING ./tiledb-data/pbmc3k_processed/raw
- Creating TileDB group ./tiledb-data/pbmc3k_processed/raw
- Creating TileDB group ./tiledb-data/pbmc3k_processed/raw/X
- START WRITING ./tiledb-data/pbmc3k_processed/raw/X/data
- START __ingest_coo_data_string_dims_rows_chunked
- START chunk rows 0..2638 of 2638, obs_ids AAACATACAACCAC-1..TTTGCATGCCTCAC-1, nnz=2238732, 100.000%
- FINISH chunk TIME 5.736
- FINISH __ingest_coo_data_string_dims_rows_chunked TIME 6.301
- FINISH WRITING ./tiledb-data/pbmc3k_processed/raw/X/data TIME 6.338
- START WRITING ./tiledb-data/pbmc3k_processed/raw/var
- FINISH WRITING ./tiledb-data/pbmc3k_processed/raw/var TIME 0.056
- Creating TileDB group ./tiledb-data/pbmc3k_processed/raw/varm
- FINISH WRITING ./tiledb-data/pbmc3k_processed/raw TIME 6.400
- START WRITING ./tiledb-data/pbmc3k_processed/uns
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns
- START WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/draw_graph
- START WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph/params
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/draw_graph/params
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/layout
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/layout TIME 0.018
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/random_state
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/draw_graph/params/random_state TIME 0.016
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph/params TIME 0.036
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/draw_graph TIME 0.038
- START WRITING ./tiledb-data/pbmc3k_processed/uns/louvain
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/louvain
- START WRITING ./tiledb-data/pbmc3k_processed/uns/louvain/params
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/louvain/params
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/random_state
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/random_state TIME 0.017
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/resolution
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain/params/resolution TIME 0.017
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/louvain/params TIME 0.036
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/louvain TIME 0.039
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain_colors
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/louvain_colors TIME 0.018
- START WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/neighbors
- START WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors/params
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/neighbors/params
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/method
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/method TIME 0.017
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/n_neighbors
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/neighbors/params/n_neighbors TIME 0.017
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors/params TIME 0.036
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/neighbors TIME 0.038
- START WRITING ./tiledb-data/pbmc3k_processed/uns/pca
- Creating TileDB group ./tiledb-data/pbmc3k_processed/uns/pca
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance TIME 0.016
- START WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance_ratio
- FINISH WRITING FROM NUMPY.NDARRAY ./tiledb-data/pbmc3k_processed/uns/pca/variance_ratio TIME 0.016
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns/pca TIME 0.034
- Skipping structured array: ./tiledb-data/pbmc3k_processed/uns/rank_genes_groups
- FINISH WRITING ./tiledb-data/pbmc3k_processed/uns TIME 0.171
-FINISH WRITING ./tiledb-data/pbmc3k_processed TIME 18.037
-FINISH SOMA.from_h5ad ./anndata/pbmc3k_processed.h5ad -> ./tiledb-data/pbmc3k_processed TIME 18.271
-```
-
-
-
-## Expected output format
-
-`./tools/desc-soma ./tiledb-data/pbmc3k_processed`
-
-
-
-```
-----------------------------------------------------------------
-Array: ./tiledb-data/pbmc3k_processed/X/data
-ArraySchema(
- domain=Domain(*[
- Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([RleFilter(), ])),
- Dim(name='var_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=22), ])),
- ]),
- attrs=[
- Attr(name='value', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
- ],
- cell_order='row-major',
- tile_order='row-major',
- capacity=100000,
- sparse=True,
- allows_duplicates=True,
-)
-
-----------------------------------------------------------------
-Array: ./tiledb-data/pbmc3k_processed/obs
-ArraySchema(
- domain=Domain(*[
- Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
- ]),
- attrs=[
- Attr(name='n_genes', dtype='int64', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
- Attr(name='percent_mito', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
- Attr(name='n_counts', dtype='float32', var=False, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
- Attr(name='louvain', dtype='
-
-See also:
-
-```
-import tiledb
-print(tiledb.group.Group('tiledb-data/pbmc3k_processed'))
-```
-
-## Diversity of formats in HDF5 files
-
-Due to file-size restrictions on GitHub, not all the following are cached in this repo.
-Nonetheless this is a selection from
-[https://cellxgene.cziscience.com](https://cellxgene.cziscience.com), as well as some
-raw-sensor data (`subset_100_100`):
-
-![anndata-filetypes](./images/isa.png)
-
-# Notes
-
-* `os.path.join` is used here but may not be appropriate if this package is run on Windows:
- * `/` has been accepted in Windows paths for some years now
- * `\` is not accepted for forming URIs
- * So, perhaps safer would be to always join on `/` regardless of platform.
-* For PRs: from within `apis/python`, run `black .` (or `black . --check to preview if you prefer) as this format-checker is run in CI.