-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
40 changed files
with
3,898 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Status | ||
|
||
Temporary and experimental. | ||
|
||
# Rationale | ||
|
||
* R docs (currently at [https://github.com/TileDB-Inc/tiledbsc](https://github.com/TileDB-Inc/tiledbsc)) have a wonderful combination of API docs (generated from in-source-code doc-blocks) as well as hand-written long-form "vignette" material. | ||
* For Python RST-style docs, I am not yet aware of a nice way to do that -- other than what's presented here. | ||
* Tools like Sphinx and readthedocs are suitable for mapping a _single repo's single-language code-docs_ into a _single doc URL_. However, for this repo, we have Python API docs, Python examples/vignettes, and -- soon -- R docs as well. We wish to publish a _multi-lingual, multi-content doc tree_. | ||
|
||
# Flow | ||
|
||
* Source is in-source-code doc-blocks within `apis/python/src/tiledbsc`, and hand-written long-form "vignette" material in `apis/python/examples`. | ||
* The former are mapped to `.md` (intentionally not `.rst`) via [apis/python/mkmd.sh](apis/python/mkmd.sh). This requires `pydoc-markdown` already installed locally. (Nothing here in this initial experiment is CI-enabled at this point.) | ||
* Then [Quarto](https://quarto.org) is used to map `.md` to `.html` via [_quarto.yml](_quarto.yml). | ||
* `quarto preview` for local preview. | ||
* `quarto render` to write static HTML into `docs/` which can then be published. | ||
* This `docs/` directory is artifacts-only and doesn't need to be committed to source control. | ||
* Then this is synced to an AWS bucket which is used to serve static HTML content: [https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html](https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html). | ||
* [https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/WebsiteAccessPermissionsReqd.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
project: | ||
type: website | ||
output-dir: docs | ||
|
||
format: | ||
html: | ||
toc: true | ||
theme: | ||
light: [flatly, "quarto-materials/tiledb.scss"] | ||
# TODO: Inter font needs custom font-install for CI | ||
#mainfont: Inter | ||
mainfont: Helvetica | ||
fontsize: 1rem | ||
linkcolor: "#4d9fff" | ||
code-copy: true | ||
code-overflow: wrap | ||
css: "quarto-materials/tiledb.css" | ||
|
||
website: | ||
favicon: "images/favicon.ico" | ||
site-url: https://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html | ||
repo-url: https://github.com/single-cell-data/TileDB-SingleCell | ||
# We may want one or both of these, or neither: | ||
repo-actions: [edit, issue] | ||
page-navigation: true | ||
navbar: | ||
background: light | ||
logo: "quarto-materials/tiledb-logo.png" | ||
collapse-below: lg | ||
left: | ||
- text: "Home page" | ||
href: "https://tiledb.com" | ||
- text: "Login" | ||
href: "https://cloud.tiledb.com/auth/login" | ||
- text: "Contact us" | ||
href: "https://tiledb.com/contact" | ||
- text: "Repo" | ||
href: "https://github.com/single-cell-data/TileDB-SingleCell" | ||
|
||
sidebar: | ||
- style: "floating" | ||
collapse-level: 2 | ||
align: left | ||
contents: | ||
- href: "overview.md" | ||
text: "Overview" | ||
|
||
- text: "R examples and API" | ||
href: "https://tiledb-inc.github.io/tiledbsc" | ||
|
||
- section: "Python" | ||
contents: | ||
|
||
- section: "Python examples" | ||
contents: | ||
- href: "apis/python/examples/obtaining-data-files.md" | ||
text: "Obtaining data files" | ||
- href: "apis/python/examples/ingesting-data-files.md" | ||
text: "Ingesting data files" | ||
- href: "apis/python/examples/anndata-and-tiledb.md" | ||
text: "Comparing AnnData and TileDB files" | ||
- href: "apis/python/examples/inspecting-schema.md" | ||
text: "Inspecting SOMA schemas" | ||
- href: "apis/python/examples/soma-collection-reconnaissance.md" | ||
text: "SOMA-collection reconnaissance" | ||
|
||
- section: "Python API" | ||
contents: | ||
- href: "apis/python/doc/overview.md" | ||
|
||
- href: "apis/python/doc/soma_collection.md" | ||
text: "SOMACollection" | ||
- href: "apis/python/doc/soma.md" | ||
text: "SOMA" | ||
|
||
- href: "apis/python/doc/soma_options.md" | ||
text: "SOMAOptions" | ||
|
||
- href: "apis/python/doc/assay_matrix_group.md" | ||
text: "AssayMatrixGroup" | ||
- href: "apis/python/doc/assay_matrix.md" | ||
text: "AssayMatrix" | ||
- href: "apis/python/doc/annotation_dataframe.md" | ||
text: "AnnotationDataFrame" | ||
- href: "apis/python/doc/annotation_matrix_group.md" | ||
text: "AnnotationMatrixGroup" | ||
- href: "apis/python/doc/annotation_matrix.md" | ||
text: "AnnotationMatrix" | ||
- href: "apis/python/doc/annotation_pairwise_matrix_group.md" | ||
text: "AnnotationPairwiseMatrixGroup" | ||
- href: "apis/python/doc/raw_group.md" | ||
text: "RawGroup" | ||
- href: "apis/python/doc/uns_group.md" | ||
text: "UnsGroup" | ||
- href: "apis/python/doc/uns_array.md" | ||
text: "UnsArray" | ||
|
||
- href: "apis/python/doc/tiledb_array.md" | ||
text: "TileDBArray" | ||
- href: "apis/python/doc/tiledb_group.md" | ||
text: "TileDBGroup" | ||
- href: "apis/python/doc/tiledb_object.md" | ||
text: "TileDBObject" | ||
|
||
- href: "apis/python/doc/util.md" | ||
text: "tiledbsc.util" | ||
- href: "apis/python/doc/util_ann.md" | ||
text: "tiledbsc.util_ann" | ||
- href: "apis/python/doc/util_tiledb.md" | ||
text: "tiledbsc.util_tiledb" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
# Overview | ||
|
||
As of 2022-05-03 we offer a chunked ingestor for `X` data within larger data files. This enables | ||
ingestion of larger data files within RAM limits of available hardware. It entails a | ||
read-performance penalty under certain query modes, as this note will articulate. | ||
|
||
# What it and isn't streamed/chunked | ||
|
||
Input `.h5ad` files are read into memory using `anndata.read_h5ad` -- if the input file is 5GB, say, | ||
about that much RAM will be needed. We don't have a way to stream the contents of the `.h5ad` file | ||
itself -- it's read all at once. | ||
|
||
Often `X` data is in CSR format; occasionally CSC, or dense (`numpy.ndarray`). Suppose for the rest | ||
of this note that we're looking at CSR `X` data -- a similar analysis will hold (_mutatis mutandis_) | ||
for CSC data. | ||
|
||
Given CSR `X` data, we find that an all-at-once `x.toarray()` can involve a huge explosion of memory | ||
requirement if the input data is sparse -- for this reason, we don't do that; we use `x.tocoo()`. In | ||
summary, `.toarray()` offers a variably huge explosion from on-disk `X` size to in-memory densified, | ||
and we don't do this. | ||
|
||
Given CSR `X` data, we find that an all-at-once `x.tocoo()` involves about a 2x or 2.5x expansion in | ||
RSS as revealed by `htop` -- CSR data from disk (and in RAM) is a list of contiguous row | ||
sub-sequences with row-subsequence values all spelled out per array cell, but only bounding column | ||
dimensions written out; COO data is a list of `(i,j,v)` tuples with the `i,j` written out | ||
individually -- which of course takes more memory. In summary, all-at-once `.tocoo()` has a memory | ||
increase from on-disk size to in-memory COO-ified, but with a lower (and more predictable) | ||
multiplication factor. | ||
|
||
The alternative discussed here applies for `.h5ad` data files which are small enough to read into | ||
RAM, but for which the 2.5x-or-so inflation from CSR to COO results in a COO matrix which is too big | ||
for RAM. | ||
|
||
# Sketch, and relevant features of TileDB storage | ||
|
||
What we will do is take chunks of the CSR -- a few rows at a time -- and convert each CSR submatrix | ||
to COO, writing each "chunk" as a TileDB fragment. This way the 2.5x memory expansion is paid only | ||
from CSR submatrix to COO submatrix, and we can lower the memory footprint needed for the ingestion | ||
into TileDB. | ||
|
||
Some facts about this: | ||
|
||
* In the `.h5ad` we have `obs`/`var` names mapping from string to int, and integer-indexed sparse/dense `X` matrices. | ||
* In TileDB, by contrast, we have the `obs`/`var` names being _themselves_ string indices into sparse `X` matrices. | ||
* TileDB storage orders its dims. That means that if you have an input matrix as on the left, with `obs_id=A,B,C,D` and `var_id=S,T,U,V`, then it will be stored as on the right: | ||
|
||
``` | ||
Input CSR TileDB storage | ||
--------- -------------- all one fragment | ||
T V S U S T U V | ||
C 1 2 . . A 4 . . 3 | ||
A . 3 4 . B: 5 . 6 . | ||
B . . 5 6 C . 1 . 2 | ||
D 7 . 8 . D 8 7 . . | ||
``` | ||
|
||
* TileDB storage is 3-level: _fragments_ (corresponding to different timestamped writes); _tiles_; and _cells_. | ||
* Fragments and tiles both have MBRs. For this example (suppose for the moment that is it's written all at once in a single fragment) the fragment MBR is `A..D` in the `obs_id` dimension and `S..V` in the `var_id` dimension. | ||
* Query modes: we expect queries by `obs_id,var_id` pairs, or by `obs_id`, or by `var_id`. Given the above representation, since tiles within the fragment are using ordered `obs_id` and `var_id`, then all three query modes will be efficient: | ||
* there's one fragment | ||
* Queries on `obs_id,var_id` will locate only one tile within the fragment | ||
* Queries on `obs_id` will locate one row of files within the fragment | ||
* Queries on `var_id` will locate one column of files within the fragment | ||
|
||
``` | ||
TileDB storage | ||
-------------- all one fragment | ||
S T : U V | ||
A 4 . : . 3 | ||
B: 5 .: 6 . | ||
.......: ...... tile boundary | ||
C . 1 : . 2 | ||
D 8 7 : . . | ||
``` | ||
|
||
# Problem statement by example | ||
|
||
## Cursor-sort of rows | ||
|
||
We next look at what we need to be concerned about when we write multiple fragments using the chunked-CSR reader. | ||
|
||
Suppose the input `X` array is in CSR format as above: | ||
|
||
``` | ||
T V S U | ||
C 1 2 . . | ||
A . 3 4 . | ||
B . . 5 6 | ||
D 7 . 8 . | ||
``` | ||
|
||
And suppose we want to write it in two chunks of two rows each. | ||
|
||
We must cursor-sort row labels so (with zero copy) the matrix will effectively look like this | ||
|
||
``` | ||
T V S U | ||
A . 3 4 . | ||
B . . 5 6 | ||
---------- chunk boundary | ||
C 1 2 . . | ||
D 7 . 8 . | ||
``` | ||
|
||
This is necessary, since otherwise every fragment would have the same MBRs in both dimensions and all queries -- whether by `obs_id,var_id`, or by `obs_id`, or by `var_id` -- would need to consult all fragments. | ||
|
||
* Chunk 1 (written as fragment 1) gets these COOs: | ||
* `A,V,3` | ||
* `A,S,4` | ||
* `B,S,5` | ||
* `B,U,6` | ||
* Chunk 2 (written as fragment 2) gets these COOs: | ||
* `C,T,1` | ||
* `C,V,2` | ||
* `D,T,7` | ||
* `D,S,8` | ||
* Fragment 1 MBR is `[A..B, S..V]` | ||
* Fragment 2 MBR is `[C..D, S..V]` | ||
* TileDB guarantees sorting on both dims within the fragment | ||
|
||
Here's the performance concern: | ||
|
||
* Queries on `obs_id,var_id` will locate only one fragment, since a given `obs_id` can only be in one fragment | ||
* Queries on `obs_id` will locate one fragment, since a given `obs_id` can only be in one fragment | ||
* Queries on `var_id` will locate _all_ fragments. (Note, however, this is the same amount of data as when the TileDB array was all in one fragment.) | ||
|
||
## Cursor-sort of columns | ||
|
||
Suppose we were to column-sort the CSR too -- it would look like this: | ||
|
||
``` | ||
S T U V | ||
A 4 . . 3 | ||
B: 5 . 6 . | ||
---------- chunk boundary | ||
C . 1 . 2 | ||
D 8 7 . . | ||
``` | ||
|
||
* Chunk 1 (written as fragment 1) gets these COOs: | ||
* `A,S,4` | ||
* `A,V,3` | ||
* `B,S,5` | ||
* `B,U,6` | ||
* Chunk 2 (written as fragment 2) gets these COOs: | ||
* `C,T,1` | ||
* `C,V,2` | ||
* `D,S,8` | ||
* `D,T,7` | ||
* Fragment 1 MBR is `[A..B, S..V]` same as before | ||
* Fragment 2 MBR is `[C..D, S..V]` same as before | ||
* TileDB guarantees sorting on both dims within the fragment | ||
|
||
But the performance concern is _identical_ to the situation without cursor-sort of columns: in fact, | ||
cursor-sorting the columns provides no benefit since TileDB is already sorting by both dimensions | ||
within fragments, and the `var_id` slot of the fragment MBRs are `S..V` in both cases. | ||
|
||
## Checkerboarding | ||
|
||
Another option is to cursor-sort by both dimensions and then checkerboard: | ||
|
||
``` | ||
S T | U V | ||
A 4 . | . 3 | ||
B: 5 .| 6 . | ||
------+----- chunk boundary | ||
C . 1 | . 2 | ||
D 8 7 | . . | ||
``` | ||
|
||
* Fragment 1 gets these COOs: | ||
* `A,S,4` | ||
* `B,S,5` | ||
* Fragment 2 gets these COOs: | ||
* `A,V,3` | ||
* `B,U,6` | ||
* Fragment 3 gets these COOs: | ||
* `C,T,1` | ||
* `D,S,8` | ||
* `D,T,7` | ||
* Fragment 4)gets these COOs: | ||
* `C,V,2` | ||
|
||
* Fragment 1 MBR is `[A..B, S..T]` | ||
* Fragment 2 MBR is `[A..B, U..V]` | ||
* Fragment 3 MBR is `[C..D, S..T]` | ||
* Fragment 4 MBR is `[C..D, U..V]` | ||
|
||
* A query for `obs_id==D` will have to look at fragments 3 and 4 | ||
* A query for `var_id==T` will have to look at fragments 1 and 3 | ||
* We still cannot achieve having only one fragment for a given `obs_id`, and only one fragment for a | ||
given `var_id` -- we'd need | ||
to have a 'block diagional matrix' _even when the row & column labels are sorted_ which is not | ||
reasonable to expect. | ||
|
||
## Global-order writes | ||
|
||
See also [Python API docs](https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/tutorials/writing-sparse.html#writing-in-global-layout). | ||
|
||
Idea: | ||
|
||
* Write in global order (sorted by `obs_id` then `var_id`) | ||
* Given the above example, we'd write | ||
* Fragment 1 gets these COOs: | ||
* `A,S,4` | ||
* `A,V,3` | ||
* `B,S,5` | ||
* `B,U,6` | ||
* Fragment 2 gets these COOs: | ||
* `C,V,2` | ||
* `C,T,1` | ||
* `D,S,8` | ||
* `D,T,7` | ||
* Easy to do in Python at the row-chunk level | ||
* Then: | ||
* Fragment writes will be faster. | ||
* Fragments will be auto-concatenated so they won't need consolidation at all. | ||
* Feature exists and is well-supported in C++. | ||
* Not yet present in the Python API. | ||
|
||
# Suggested approach | ||
|
||
* Use row-based chunking (checkerboard is not implemented as of 2022-05-03). | ||
* Given that queries on `obs_id,var_id` or on `obs_id` will be efficient, but that queries on `var_id` will require consulting multiple fragments, ingest larger arrays as row-chunked CSR but consolidate them afterward. | ||
* As of TileDB core 2.8.2, we cannot consolidate arrays with col-major tile order: so we write `X` with row-major tile order. | ||
* Read-performance impact should be measured explicitly. | ||
* Global-order writes need to be looked into. |
Oops, something went wrong.