Initial web-doc experiment (#136)

single-cell-data · May 31, 2022 · a367a8c · a367a8c
1 parent bbc3022
commit a367a8c
Show file tree

Hide file tree

Showing 40 changed files with 3,898 additions and 0 deletions.
diff --git a/README-docs.md b/README-docs.md
@@ -0,0 +1,20 @@
+# Status
+
+Temporary and experimental.
+
+# Rationale
+
+* R docs (currently at [https://github.com/TileDB-Inc/tiledbsc](https://github.com/TileDB-Inc/tiledbsc)) have a wonderful combination of API docs (generated from in-source-code doc-blocks) as well as hand-written long-form "vignette" material.
+* For Python RST-style docs, I am not yet aware of a nice way to do that -- other than what's presented here.
+* Tools like Sphinx and readthedocs are suitable for mapping a _single repo's single-language code-docs_ into a _single doc URL_. However, for this repo, we have Python API docs, Python examples/vignettes, and -- soon -- R docs as well. We wish to publish a _multi-lingual, multi-content doc tree_.
+
+# Flow
+
+* Source is in-source-code doc-blocks within `apis/python/src/tiledbsc`, and hand-written long-form "vignette" material in `apis/python/examples`.
+* The former are mapped to `.md` (intentionally not `.rst`) via [apis/python/mkmd.sh](apis/python/mkmd.sh). This requires `pydoc-markdown` already installed locally. (Nothing here in this initial experiment is CI-enabled at this point.)
+* Then [Quarto](https://quarto.org) is used to map `.md` to `.html` via [_quarto.yml](_quarto.yml).
+  * `quarto preview` for local preview.
+  * `quarto render` to write static HTML into `docs/` which can then be published.
+  * This `docs/` directory is artifacts-only and doesn't need to be committed to source control.
+* Then this is synced to an AWS bucket which is used to serve static HTML content: [https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html](https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html).
+  * [https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html)
diff --git a/_quarto.yml b/_quarto.yml
@@ -0,0 +1,110 @@
+project:
+  type: website
+  output-dir: docs
+
+format:
+  html:
+    toc: true
+    theme: 
+      light: [flatly, "quarto-materials/tiledb.scss"]
+    # TODO: Inter font needs custom font-install for CI
+    #mainfont: Inter
+    mainfont: Helvetica
+    fontsize: 1rem
+    linkcolor: "#4d9fff"
+    code-copy: true
+    code-overflow: wrap
+    css: "quarto-materials/tiledb.css"
+
+website:
+  favicon: "images/favicon.ico"
+  site-url: https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html
+  repo-url: https://github.com/single-cell-data/TileDB-SingleCell
+  # We may want one or both of these, or neither:
+  repo-actions: [edit, issue]
+  page-navigation: true
+  navbar:
+    background: light
+    logo: "quarto-materials/tiledb-logo.png"
+    collapse-below: lg
+    left:
+      - text: "Home page"
+        href: "https://tiledb.com"
+      - text: "Login"
+        href: "https://cloud.tiledb.com/auth/login"
+      - text: "Contact us"
+        href: "https://tiledb.com/contact"
+      - text: "Repo"
+        href: "https://github.com/single-cell-data/TileDB-SingleCell"
+
+  sidebar:
+    - style: "floating"
+      collapse-level: 2
+      align: left
+      contents:
+        - href: "overview.md"
+          text: "Overview"
+
+        - text: "R examples and API"
+          href: "https://tiledb-inc.github.io/tiledbsc"
+
+        - section: "Python"
+          contents:
+
+            - section: "Python examples"
+              contents:
+                - href: "apis/python/examples/obtaining-data-files.md"
+                  text: "Obtaining data files"
+                - href: "apis/python/examples/ingesting-data-files.md"
+                  text: "Ingesting data files"
+                - href: "apis/python/examples/anndata-and-tiledb.md"
+                  text: "Comparing AnnData and TileDB files"
+                - href: "apis/python/examples/inspecting-schema.md"
+                  text: "Inspecting SOMA schemas"
+                - href: "apis/python/examples/soma-collection-reconnaissance.md"
+                  text: "SOMA-collection reconnaissance"
+
+            - section: "Python API"
+              contents:
+                - href: "apis/python/doc/overview.md"
+
+                - href: "apis/python/doc/soma_collection.md"
+                  text: "SOMACollection"
+                - href: "apis/python/doc/soma.md"
+                  text: "SOMA"
+
+                - href: "apis/python/doc/soma_options.md"
+                  text: "SOMAOptions"
+
+                - href: "apis/python/doc/assay_matrix_group.md"
+                  text: "AssayMatrixGroup"
+                - href: "apis/python/doc/assay_matrix.md"
+                  text: "AssayMatrix"
+                - href: "apis/python/doc/annotation_dataframe.md"
+                  text: "AnnotationDataFrame"
+                - href: "apis/python/doc/annotation_matrix_group.md"
+                  text: "AnnotationMatrixGroup"
+                - href: "apis/python/doc/annotation_matrix.md"
+                  text: "AnnotationMatrix"
+                - href: "apis/python/doc/annotation_pairwise_matrix_group.md"
+                  text: "AnnotationPairwiseMatrixGroup"
+                - href: "apis/python/doc/raw_group.md"
+                  text: "RawGroup"
+                - href: "apis/python/doc/uns_group.md"
+                  text: "UnsGroup"
+                - href: "apis/python/doc/uns_array.md"
+                  text: "UnsArray"
+
+                - href: "apis/python/doc/tiledb_array.md"
+                  text: "TileDBArray"
+                - href: "apis/python/doc/tiledb_group.md"
+                  text: "TileDBGroup"
+                - href: "apis/python/doc/tiledb_object.md"
+                  text: "TileDBObject"
+
+                - href: "apis/python/doc/util.md"
+                  text: "tiledbsc.util"
+                - href: "apis/python/doc/util_ann.md"
+                  text: "tiledbsc.util_ann"
+                - href: "apis/python/doc/util_tiledb.md"
+                  text: "tiledbsc.util_tiledb"
diff --git a/apis/python/doc/README-csr-ingest.md b/apis/python/doc/README-csr-ingest.md
@@ -0,0 +1,227 @@
+# Overview
+
+As of 2022-05-03 we offer a chunked ingestor for `X` data within larger data files. This enables
+ingestion of larger data files within RAM limits of available hardware. It entails a
+read-performance penalty under certain query modes, as this note will articulate.
+
+# What it and isn't streamed/chunked
+
+Input `.h5ad` files are read into memory using `anndata.read_h5ad` -- if the input file is 5GB, say,
+about that much RAM will be needed. We don't have a way to stream the contents of the `.h5ad` file
+itself -- it's read all at once.
+
+Often `X` data is in CSR format; occasionally CSC, or dense (`numpy.ndarray`).  Suppose for the rest
+of this note that we're looking at CSR `X` data -- a similar analysis will hold (_mutatis mutandis_)
+for CSC data.
+
+Given CSR `X` data, we find that an all-at-once `x.toarray()` can involve a huge explosion of memory
+requirement if the input data is sparse -- for this reason, we don't do that; we use `x.tocoo()`. In
+summary, `.toarray()` offers a variably huge explosion from on-disk `X` size to in-memory densified,
+and we don't do this.
+
+Given CSR `X` data, we find that an all-at-once `x.tocoo()` involves about a 2x or 2.5x expansion in
+RSS as revealed by `htop` -- CSR data from disk (and in RAM) is a list of contiguous row
+sub-sequences with row-subsequence values all spelled out per array cell, but only bounding column
+dimensions written out; COO data is a list of `(i,j,v)` tuples with the `i,j` written out
+individually -- which of course takes more memory. In summary, all-at-once `.tocoo()` has a memory
+increase from on-disk size to in-memory COO-ified, but with a lower (and more predictable)
+multiplication factor.
+
+The alternative discussed here applies for `.h5ad` data files which are small enough to read into
+RAM, but for which the 2.5x-or-so inflation from CSR to COO results in a COO matrix which is too big
+for RAM.
+
+# Sketch, and relevant features of TileDB storage
+
+What we will do is take chunks of the CSR -- a few rows at a time -- and convert each CSR submatrix
+to COO, writing each "chunk" as a TileDB fragment.  This way the 2.5x memory expansion is paid only
+from CSR submatrix to COO submatrix, and we can lower the memory footprint needed for the ingestion
+into TileDB.
+
+Some facts about this:
+
+* In the `.h5ad` we have `obs`/`var` names mapping from string to int, and integer-indexed sparse/dense `X` matrices.
+* In TileDB, by contrast, we have the `obs`/`var` names being _themselves_ string indices into sparse `X` matrices.
+* TileDB storage orders its dims. That means that if you have an input matrix as on the left, with `obs_id=A,B,C,D` and `var_id=S,T,U,V`, then it will be stored as on the right:
+
+```
+  Input CSR    TileDB storage
+  ---------    --------------  all one fragment
+    T V S U      S T U V
+  C 1 2 . .    A 4 . . 3
+  A . 3 4 .    B: 5 . 6 .
+  B . . 5 6    C . 1 . 2
+  D 7 . 8 .    D 8 7 . .
+```
+
+* TileDB storage is 3-level: _fragments_ (corresponding to different timestamped writes); _tiles_; and _cells_.
+* Fragments and tiles both have MBRs. For this example (suppose for the moment that is it's written all at once in a single fragment) the fragment MBR is `A..D` in the `obs_id` dimension and `S..V` in the `var_id` dimension.
+* Query modes: we expect queries by `obs_id,var_id` pairs, or by `obs_id`, or by `var_id`. Given the above representation, since tiles within the fragment are using ordered `obs_id` and `var_id`, then all three query modes will be efficient:
+  * there's one fragment
+  * Queries on `obs_id,var_id` will locate only one tile within the fragment
+  * Queries on `obs_id` will locate one row of files within the fragment
+  * Queries on `var_id` will locate one column of files within the fragment
+
+```
+  TileDB storage
+  --------------  all one fragment
+    S T : U V
+  A 4 . : . 3
+  B: 5 .:  6 .
+ .......: ...... tile boundary
+  C . 1 : . 2
+  D 8 7 : . .
+```
+
+# Problem statement by example
+
+## Cursor-sort of rows
+
+We next look at what we need to be concerned about when we write multiple fragments using the chunked-CSR reader.
+
+Suppose the input `X` array is in CSR format as above:
+
+```
+    T V S U
+  C 1 2 . .
+  A . 3 4 .
+  B . . 5 6
+  D 7 . 8 .
+```
+
+And suppose we want to write it in two chunks of two rows each.
+
+We must cursor-sort row labels so (with zero copy) the matrix will effectively look like this
+
+```
+    T V S U
+  A . 3 4 .
+  B . . 5 6
+  ---------- chunk boundary
+  C 1 2 . .
+  D 7 . 8 .
+```
+
+This is necessary, since otherwise every fragment would have the same MBRs in both dimensions and all queries -- whether by `obs_id,var_id`, or by `obs_id`, or by `var_id` -- would need to consult all fragments.
+
+* Chunk 1 (written as fragment 1) gets these COOs:
+  * `A,V,3`
+  * `A,S,4`
+  * `B,S,5`
+  * `B,U,6`
+* Chunk 2 (written as fragment 2) gets these COOs:
+  * `C,T,1`
+  * `C,V,2`
+  * `D,T,7`
+  * `D,S,8`
+* Fragment 1 MBR is `[A..B, S..V]`
+* Fragment 2 MBR is `[C..D, S..V]`
+* TileDB guarantees sorting on both dims within the fragment
+
+Here's the performance concern:
+
+* Queries on `obs_id,var_id` will locate only one fragment, since a given `obs_id` can only be in one fragment
+* Queries on `obs_id` will locate one fragment, since a given `obs_id` can only be in one fragment
+* Queries on `var_id` will locate _all_ fragments. (Note, however, this is the same amount of data as when the TileDB array was all in one fragment.)
+
+## Cursor-sort of columns
+
+Suppose we were to column-sort the CSR too -- it would look like this:
+
+```
+    S T U V
+  A 4 . . 3
+  B: 5 . 6 .
+  ---------- chunk boundary
+  C . 1 . 2
+  D 8 7 . .
+```
+
+* Chunk 1 (written as fragment 1) gets these COOs:
+  * `A,S,4`
+  * `A,V,3`
+  * `B,S,5`
+  * `B,U,6`
+* Chunk 2 (written as fragment 2) gets these COOs:
+  * `C,T,1`
+  * `C,V,2`
+  * `D,S,8`
+  * `D,T,7`
+* Fragment 1 MBR is `[A..B, S..V]` same as before
+* Fragment 2 MBR is `[C..D, S..V]` same as before
+* TileDB guarantees sorting on both dims within the fragment
+
+But the performance concern is _identical_ to the situation without cursor-sort of columns: in fact,
+cursor-sorting the columns provides no benefit since TileDB is already sorting by both dimensions
+within fragments, and the `var_id` slot of the fragment MBRs are `S..V` in both cases.
+
+## Checkerboarding
+
+Another option is to cursor-sort by both dimensions and then checkerboard:
+
+```
+    S T | U V
+  A 4 . | . 3
+  B: 5 .|  6 .
+  ------+----- chunk boundary
+  C . 1 | . 2
+  D 8 7 | . .
+```
+
+* Fragment 1 gets these COOs:
+  * `A,S,4`
+  * `B,S,5`
+* Fragment 2 gets these COOs:
+  * `A,V,3`
+  * `B,U,6`
+* Fragment 3 gets these COOs:
+  * `C,T,1`
+  * `D,S,8`
+  * `D,T,7`
+* Fragment 4)gets these COOs:
+  * `C,V,2`
+
+* Fragment 1 MBR is `[A..B, S..T]`
+* Fragment 2 MBR is `[A..B, U..V]`
+* Fragment 3 MBR is `[C..D, S..T]`
+* Fragment 4 MBR is `[C..D, U..V]`
+
+* A query for `obs_id==D` will have to look at fragments 3 and 4
+* A query for `var_id==T` will have to look at fragments 1 and 3
+* We still cannot achieve having only one fragment for a given `obs_id`, and only one fragment for a
+  given `var_id` -- we'd need
+  to have a 'block diagional matrix' _even when the row & column labels are sorted_ which is not
+  reasonable to expect.
+
+## Global-order writes
+
+See also [Python API docs](https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/tutorials/writing-sparse.html#writing-in-global-layout).
+
+Idea:
+
+* Write in global order (sorted by `obs_id` then `var_id`)
+* Given the above example, we'd write
+  * Fragment 1 gets these COOs:
+    * `A,S,4`
+    * `A,V,3`
+    * `B,S,5`
+    * `B,U,6`
+  * Fragment 2 gets these COOs:
+    * `C,V,2`
+    * `C,T,1`
+    * `D,S,8`
+    * `D,T,7`
+* Easy to do in Python at the row-chunk level
+* Then:
+  * Fragment writes will be faster.
+  * Fragments will be auto-concatenated so they won't need consolidation at all.
+  * Feature exists and is well-supported in C++.
+  * Not yet present in the Python API.
+
+# Suggested approach
+
+* Use row-based chunking (checkerboard is not implemented as of 2022-05-03).
+* Given that queries on `obs_id,var_id` or on `obs_id` will be efficient, but that queries on `var_id` will require consulting multiple fragments, ingest larger arrays as row-chunked CSR but consolidate them afterward.
+* As of TileDB core 2.8.2, we cannot consolidate arrays with col-major tile order: so we write `X` with row-major tile order.
+* Read-performance impact should be measured explicitly.
+* Global-order writes need to be looked into.