Skip to content

Commit

Permalink
Initial web-doc experiment (#136)
Browse files Browse the repository at this point in the history
  • Loading branch information
johnkerl authored May 31, 2022
1 parent bbc3022 commit a367a8c
Show file tree
Hide file tree
Showing 40 changed files with 3,898 additions and 0 deletions.
20 changes: 20 additions & 0 deletions README-docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Status

Temporary and experimental.

# Rationale

* R docs (currently at [https://github.com/TileDB-Inc/tiledbsc](https://github.com/TileDB-Inc/tiledbsc)) have a wonderful combination of API docs (generated from in-source-code doc-blocks) as well as hand-written long-form "vignette" material.
* For Python RST-style docs, I am not yet aware of a nice way to do that -- other than what's presented here.
* Tools like Sphinx and readthedocs are suitable for mapping a _single repo's single-language code-docs_ into a _single doc URL_. However, for this repo, we have Python API docs, Python examples/vignettes, and -- soon -- R docs as well. We wish to publish a _multi-lingual, multi-content doc tree_.

# Flow

* Source is in-source-code doc-blocks within `apis/python/src/tiledbsc`, and hand-written long-form "vignette" material in `apis/python/examples`.
* The former are mapped to `.md` (intentionally not `.rst`) via [apis/python/mkmd.sh](apis/python/mkmd.sh). This requires `pydoc-markdown` already installed locally. (Nothing here in this initial experiment is CI-enabled at this point.)
* Then [Quarto](https://quarto.org) is used to map `.md` to `.html` via [_quarto.yml](_quarto.yml).
* `quarto preview` for local preview.
* `quarto render` to write static HTML into `docs/` which can then be published.
* This `docs/` directory is artifacts-only and doesn't need to be committed to source control.
* Then this is synced to an AWS bucket which is used to serve static HTML content: [https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html](https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html).
* [https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html)
110 changes: 110 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
project:
type: website
output-dir: docs

format:
html:
toc: true
theme:
light: [flatly, "quarto-materials/tiledb.scss"]
# TODO: Inter font needs custom font-install for CI
#mainfont: Inter
mainfont: Helvetica
fontsize: 1rem
linkcolor: "#4d9fff"
code-copy: true
code-overflow: wrap
css: "quarto-materials/tiledb.css"

website:
favicon: "images/favicon.ico"
site-url: https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html
repo-url: https://github.com/single-cell-data/TileDB-SingleCell
# We may want one or both of these, or neither:
repo-actions: [edit, issue]
page-navigation: true
navbar:
background: light
logo: "quarto-materials/tiledb-logo.png"
collapse-below: lg
left:
- text: "Home page"
href: "https://tiledb.com"
- text: "Login"
href: "https://cloud.tiledb.com/auth/login"
- text: "Contact us"
href: "https://tiledb.com/contact"
- text: "Repo"
href: "https://github.com/single-cell-data/TileDB-SingleCell"

sidebar:
- style: "floating"
collapse-level: 2
align: left
contents:
- href: "overview.md"
text: "Overview"

- text: "R examples and API"
href: "https://tiledb-inc.github.io/tiledbsc"

- section: "Python"
contents:

- section: "Python examples"
contents:
- href: "apis/python/examples/obtaining-data-files.md"
text: "Obtaining data files"
- href: "apis/python/examples/ingesting-data-files.md"
text: "Ingesting data files"
- href: "apis/python/examples/anndata-and-tiledb.md"
text: "Comparing AnnData and TileDB files"
- href: "apis/python/examples/inspecting-schema.md"
text: "Inspecting SOMA schemas"
- href: "apis/python/examples/soma-collection-reconnaissance.md"
text: "SOMA-collection reconnaissance"

- section: "Python API"
contents:
- href: "apis/python/doc/overview.md"

- href: "apis/python/doc/soma_collection.md"
text: "SOMACollection"
- href: "apis/python/doc/soma.md"
text: "SOMA"

- href: "apis/python/doc/soma_options.md"
text: "SOMAOptions"

- href: "apis/python/doc/assay_matrix_group.md"
text: "AssayMatrixGroup"
- href: "apis/python/doc/assay_matrix.md"
text: "AssayMatrix"
- href: "apis/python/doc/annotation_dataframe.md"
text: "AnnotationDataFrame"
- href: "apis/python/doc/annotation_matrix_group.md"
text: "AnnotationMatrixGroup"
- href: "apis/python/doc/annotation_matrix.md"
text: "AnnotationMatrix"
- href: "apis/python/doc/annotation_pairwise_matrix_group.md"
text: "AnnotationPairwiseMatrixGroup"
- href: "apis/python/doc/raw_group.md"
text: "RawGroup"
- href: "apis/python/doc/uns_group.md"
text: "UnsGroup"
- href: "apis/python/doc/uns_array.md"
text: "UnsArray"

- href: "apis/python/doc/tiledb_array.md"
text: "TileDBArray"
- href: "apis/python/doc/tiledb_group.md"
text: "TileDBGroup"
- href: "apis/python/doc/tiledb_object.md"
text: "TileDBObject"

- href: "apis/python/doc/util.md"
text: "tiledbsc.util"
- href: "apis/python/doc/util_ann.md"
text: "tiledbsc.util_ann"
- href: "apis/python/doc/util_tiledb.md"
text: "tiledbsc.util_tiledb"
227 changes: 227 additions & 0 deletions apis/python/doc/README-csr-ingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Overview

As of 2022-05-03 we offer a chunked ingestor for `X` data within larger data files. This enables
ingestion of larger data files within RAM limits of available hardware. It entails a
read-performance penalty under certain query modes, as this note will articulate.

# What it and isn't streamed/chunked

Input `.h5ad` files are read into memory using `anndata.read_h5ad` -- if the input file is 5GB, say,
about that much RAM will be needed. We don't have a way to stream the contents of the `.h5ad` file
itself -- it's read all at once.

Often `X` data is in CSR format; occasionally CSC, or dense (`numpy.ndarray`). Suppose for the rest
of this note that we're looking at CSR `X` data -- a similar analysis will hold (_mutatis mutandis_)
for CSC data.

Given CSR `X` data, we find that an all-at-once `x.toarray()` can involve a huge explosion of memory
requirement if the input data is sparse -- for this reason, we don't do that; we use `x.tocoo()`. In
summary, `.toarray()` offers a variably huge explosion from on-disk `X` size to in-memory densified,
and we don't do this.

Given CSR `X` data, we find that an all-at-once `x.tocoo()` involves about a 2x or 2.5x expansion in
RSS as revealed by `htop` -- CSR data from disk (and in RAM) is a list of contiguous row
sub-sequences with row-subsequence values all spelled out per array cell, but only bounding column
dimensions written out; COO data is a list of `(i,j,v)` tuples with the `i,j` written out
individually -- which of course takes more memory. In summary, all-at-once `.tocoo()` has a memory
increase from on-disk size to in-memory COO-ified, but with a lower (and more predictable)
multiplication factor.

The alternative discussed here applies for `.h5ad` data files which are small enough to read into
RAM, but for which the 2.5x-or-so inflation from CSR to COO results in a COO matrix which is too big
for RAM.

# Sketch, and relevant features of TileDB storage

What we will do is take chunks of the CSR -- a few rows at a time -- and convert each CSR submatrix
to COO, writing each "chunk" as a TileDB fragment. This way the 2.5x memory expansion is paid only
from CSR submatrix to COO submatrix, and we can lower the memory footprint needed for the ingestion
into TileDB.

Some facts about this:

* In the `.h5ad` we have `obs`/`var` names mapping from string to int, and integer-indexed sparse/dense `X` matrices.
* In TileDB, by contrast, we have the `obs`/`var` names being _themselves_ string indices into sparse `X` matrices.
* TileDB storage orders its dims. That means that if you have an input matrix as on the left, with `obs_id=A,B,C,D` and `var_id=S,T,U,V`, then it will be stored as on the right:

```
Input CSR TileDB storage
--------- -------------- all one fragment
T V S U S T U V
C 1 2 . . A 4 . . 3
A . 3 4 . B: 5 . 6 .
B . . 5 6 C . 1 . 2
D 7 . 8 . D 8 7 . .
```

* TileDB storage is 3-level: _fragments_ (corresponding to different timestamped writes); _tiles_; and _cells_.
* Fragments and tiles both have MBRs. For this example (suppose for the moment that is it's written all at once in a single fragment) the fragment MBR is `A..D` in the `obs_id` dimension and `S..V` in the `var_id` dimension.
* Query modes: we expect queries by `obs_id,var_id` pairs, or by `obs_id`, or by `var_id`. Given the above representation, since tiles within the fragment are using ordered `obs_id` and `var_id`, then all three query modes will be efficient:
* there's one fragment
* Queries on `obs_id,var_id` will locate only one tile within the fragment
* Queries on `obs_id` will locate one row of files within the fragment
* Queries on `var_id` will locate one column of files within the fragment

```
TileDB storage
-------------- all one fragment
S T : U V
A 4 . : . 3
B: 5 .: 6 .
.......: ...... tile boundary
C . 1 : . 2
D 8 7 : . .
```

# Problem statement by example

## Cursor-sort of rows

We next look at what we need to be concerned about when we write multiple fragments using the chunked-CSR reader.

Suppose the input `X` array is in CSR format as above:

```
T V S U
C 1 2 . .
A . 3 4 .
B . . 5 6
D 7 . 8 .
```

And suppose we want to write it in two chunks of two rows each.

We must cursor-sort row labels so (with zero copy) the matrix will effectively look like this

```
T V S U
A . 3 4 .
B . . 5 6
---------- chunk boundary
C 1 2 . .
D 7 . 8 .
```

This is necessary, since otherwise every fragment would have the same MBRs in both dimensions and all queries -- whether by `obs_id,var_id`, or by `obs_id`, or by `var_id` -- would need to consult all fragments.

* Chunk 1 (written as fragment 1) gets these COOs:
* `A,V,3`
* `A,S,4`
* `B,S,5`
* `B,U,6`
* Chunk 2 (written as fragment 2) gets these COOs:
* `C,T,1`
* `C,V,2`
* `D,T,7`
* `D,S,8`
* Fragment 1 MBR is `[A..B, S..V]`
* Fragment 2 MBR is `[C..D, S..V]`
* TileDB guarantees sorting on both dims within the fragment

Here's the performance concern:

* Queries on `obs_id,var_id` will locate only one fragment, since a given `obs_id` can only be in one fragment
* Queries on `obs_id` will locate one fragment, since a given `obs_id` can only be in one fragment
* Queries on `var_id` will locate _all_ fragments. (Note, however, this is the same amount of data as when the TileDB array was all in one fragment.)

## Cursor-sort of columns

Suppose we were to column-sort the CSR too -- it would look like this:

```
S T U V
A 4 . . 3
B: 5 . 6 .
---------- chunk boundary
C . 1 . 2
D 8 7 . .
```

* Chunk 1 (written as fragment 1) gets these COOs:
* `A,S,4`
* `A,V,3`
* `B,S,5`
* `B,U,6`
* Chunk 2 (written as fragment 2) gets these COOs:
* `C,T,1`
* `C,V,2`
* `D,S,8`
* `D,T,7`
* Fragment 1 MBR is `[A..B, S..V]` same as before
* Fragment 2 MBR is `[C..D, S..V]` same as before
* TileDB guarantees sorting on both dims within the fragment

But the performance concern is _identical_ to the situation without cursor-sort of columns: in fact,
cursor-sorting the columns provides no benefit since TileDB is already sorting by both dimensions
within fragments, and the `var_id` slot of the fragment MBRs are `S..V` in both cases.

## Checkerboarding

Another option is to cursor-sort by both dimensions and then checkerboard:

```
S T | U V
A 4 . | . 3
B: 5 .| 6 .
------+----- chunk boundary
C . 1 | . 2
D 8 7 | . .
```

* Fragment 1 gets these COOs:
* `A,S,4`
* `B,S,5`
* Fragment 2 gets these COOs:
* `A,V,3`
* `B,U,6`
* Fragment 3 gets these COOs:
* `C,T,1`
* `D,S,8`
* `D,T,7`
* Fragment 4)gets these COOs:
* `C,V,2`

* Fragment 1 MBR is `[A..B, S..T]`
* Fragment 2 MBR is `[A..B, U..V]`
* Fragment 3 MBR is `[C..D, S..T]`
* Fragment 4 MBR is `[C..D, U..V]`

* A query for `obs_id==D` will have to look at fragments 3 and 4
* A query for `var_id==T` will have to look at fragments 1 and 3
* We still cannot achieve having only one fragment for a given `obs_id`, and only one fragment for a
given `var_id` -- we'd need
to have a 'block diagional matrix' _even when the row & column labels are sorted_ which is not
reasonable to expect.

## Global-order writes

See also [Python API docs](https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/tutorials/writing-sparse.html#writing-in-global-layout).

Idea:

* Write in global order (sorted by `obs_id` then `var_id`)
* Given the above example, we'd write
* Fragment 1 gets these COOs:
* `A,S,4`
* `A,V,3`
* `B,S,5`
* `B,U,6`
* Fragment 2 gets these COOs:
* `C,V,2`
* `C,T,1`
* `D,S,8`
* `D,T,7`
* Easy to do in Python at the row-chunk level
* Then:
* Fragment writes will be faster.
* Fragments will be auto-concatenated so they won't need consolidation at all.
* Feature exists and is well-supported in C++.
* Not yet present in the Python API.

# Suggested approach

* Use row-based chunking (checkerboard is not implemented as of 2022-05-03).
* Given that queries on `obs_id,var_id` or on `obs_id` will be efficient, but that queries on `var_id` will require consulting multiple fragments, ingest larger arrays as row-chunked CSR but consolidate them afterward.
* As of TileDB core 2.8.2, we cannot consolidate arrays with col-major tile order: so we write `X` with row-major tile order.
* Read-performance impact should be measured explicitly.
* Global-order writes need to be looked into.
Loading

0 comments on commit a367a8c

Please sign in to comment.