Skip to content

Commit

Permalink
💄 Polish scrna-tiledbsoma (#150)
Browse files Browse the repository at this point in the history
  • Loading branch information
falexwolf authored Sep 4, 2024
1 parent 8613ce8 commit 732d54b
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 104 deletions.
173 changes: 72 additions & 101 deletions docs/scrna6.ipynb → docs/scrna-tiledbsoma.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,26 +20,9 @@
"source": [
"In the previous notebooks, we've seen how to incrementally create a collection of scRNA-seq datasets and train models on it.\n",
"\n",
"Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this [blog post](https://lamin.ai/blog/arrayloader-benchmarks)). This is what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`).\n",
"Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.\n",
"\n",
":::{note}\n",
"\n",
"This notebook shows how `lamindb` can be used with `tiledbsoma` append mode, also expained in [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).\n",
"\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import lamindb as ln\n",
"import pandas as pd\n",
"import scanpy as sc\n",
"import tiledbsoma.io\n",
"from functools import reduce"
"This is also what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`)."
]
},
{
Expand All @@ -52,6 +35,12 @@
},
"outputs": [],
"source": [
"import lamindb as ln\n",
"import pandas as pd\n",
"import scanpy as sc\n",
"import tiledbsoma.io\n",
"from functools import reduce\n",
"\n",
"ln.context.uid = \"oJN8WmVrxI8m0000\"\n",
"ln.context.track()"
]
Expand All @@ -60,7 +49,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Query the collection of `h5ad` files that we'd like to convert into a single array."
"Query the collection of `h5ad` files that we'd like to concatenate into a single array."
]
},
{
Expand Down Expand Up @@ -90,7 +79,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to prepare the`AnnData` objects in the collection to be concatenated into one `tiledbsoma.Experiment`. They need to have the same `.var` and `.obs` columns, `.uns` and `.obsp` should be removed."
"To concatenate the `AnnData` objects into a single `tiledbsoma.Experiment`, they need to have the same `.var` and `.obs` columns."
]
},
{
Expand All @@ -99,32 +88,20 @@
"metadata": {},
"outputs": [],
"source": [
"adatas = [artifact.load() for artifact in collection.ordered_artifacts]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute the intersetion of all columns. All `AnnData` objects should have the same columns in their `.obs`, `.var`, `.raw.var` to be ingested into one `tiledbsoma.Experiment`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])\n",
"var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])\n",
"var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])"
"# load a number of AnnData objects that's small enough to fit into memory\n",
"adatas = [artifact.load() for artifact in collection.ordered_artifacts]\n",
"\n",
"# compute the intersection of columns for these objects\n",
"var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas]) # this only affects metadata columns of features (say, gene annotations)\n",
"var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])\n",
"obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas]) # this actually subsets features (dataset dimensions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop slots. Here we have to drop `.obsp`, `.uns` and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide `obs` and `var` names in `ln.integrations.save_tiledbsoma_experiment`, so we create these fileds (`obs_id`, `var_id`) from the dataframe indices."
"Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop `.obsp`, `.uns` and columns that aren't part of the intersection."
]
},
{
Expand All @@ -134,15 +111,15 @@
"outputs": [],
"source": [
"for i, adata in enumerate(adatas):\n",
" del adata.obsp\n",
" del adata.uns\n",
" del adata.obsp # not supported by tiledbsoma\n",
" del adata.uns # not supported by tiledbsoma\n",
" \n",
" adata.obs = adata.obs.filter(obs_columns)\n",
" adata.obs[\"obs_id\"] = adata.obs.index\n",
" adata.obs = adata.obs.filter(obs_columns) # filter columns to intersection\n",
" adata.obs[\"obs_id\"] = adata.obs.index # prepare a column for tiledbsoma to use as an index\n",
" adata.obs[\"dataset\"] = i\n",
" adata.obs.index.name = None\n",
" \n",
" adata.var = adata.var.filter(var_columns)\n",
" adata.var = adata.var.filter(var_columns) # filter columns to intersection\n",
" adata.var[\"var_id\"] = adata.var.index\n",
" adata.var.index.name = None\n",
" \n",
Expand All @@ -163,9 +140,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Ingest the `AnnData` objects. This saves the `AnnData` objects in one array store, creates `Artifact` and saves it. This function also writes current `run.uid` to `tiledbsoma.Experiment` `obs`, under `lamin_run_uid`. \n",
"\n",
"If you know `tiledbsoma` API, then note, that `ln.integrations.save_tiledbsoma_experiment` includes both `tiledbsoma.io.register_anndatas` and `tiledbsoma.io.from_anndata`."
"Save the `AnnData` objects in one array store referenced by an `Artifact`."
]
},
{
Expand All @@ -188,6 +163,19 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"Provenance is tracked by writing the current `run.uid` to `tiledbsoma.Experiment.obs` as `lamin_run_uid`.\n",
"\n",
"If you know `tiledbsoma` API, then note that {func}`~docs:lamindb.integrations.save_tiledbsoma_experiment` abstracts over both `tiledbsoma.io.register_anndatas` and `tiledbsoma.io.from_anndata`.\n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -199,7 +187,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Open and query the experiment. We can use the registered `Artifact`. Here we query `obs` from the array store."
"Here we query the `obs` from the array store."
]
},
{
Expand All @@ -221,14 +209,14 @@
" \n",
" obs_store_df = obs.read().concat().to_pandas()\n",
" \n",
" print(obs_store_df)"
" display(obs_store_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Append `AnnData` to the array store"
"## Append to the array store"
]
},
{
Expand All @@ -244,15 +232,8 @@
"metadata": {},
"outputs": [],
"source": [
"adata = ln.core.datasets.anndata_with_obs()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adata = ln.core.datasets.anndata_with_obs()\n",
"\n",
"adata.obs_names_make_unique()\n",
"adata.var_names_make_unique()\n",
"\n",
Expand All @@ -265,23 +246,16 @@
"adata.obs = adata.obs[obs_columns_same]\n",
"\n",
"var_columns_same = [var_col for var_col in adata.var.columns if var_col in var_columns_store]\n",
"adata.var = adata.var[var_columns_same]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adata.var = adata.var[var_columns_same]\n",
"\n",
"adata.write_h5ad(\"adata_to_append.h5ad\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Append the `AnnData` object from disk. This also creates a new version of `soma_artifact`."
"Append the `AnnData` object from disk by revising `soma_artifact`."
]
},
{
Expand Down Expand Up @@ -314,7 +288,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Read `X` from the store."
"Add a new embedding to the existing array store."
]
},
{
Expand All @@ -323,43 +297,22 @@
"metadata": {},
"outputs": [],
"source": [
"with soma_artifact.open() as soma_store: # mode=\"r\" by default\n",
"# read the data matrix\n",
"with soma_artifact.open() as soma_store:\n",
" ms_rna = soma_store[\"ms\"][\"RNA\"]\n",
" n_obs = len(soma_store[\"obs\"])\n",
" n_var = len(ms_rna[\"var\"])\n",
" X = ms_rna[\"X\"][\"data\"].read().coos((n_obs, n_var)).concat().to_scipy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate PCA from the queried `X`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" X = ms_rna[\"X\"][\"data\"].read().coos((n_obs, n_var)).concat().to_scipy()\n",
"\n",
"# calculate PCA embedding from the queried `X`\n",
"pca_array = sc.pp.pca(X, n_comps=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"soma_artifact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version. "
"Open the array store in write mode and add PCA."
]
},
{
Expand All @@ -386,7 +339,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the artifact has been changed."
"## See array store mutations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance."
]
},
{
Expand All @@ -395,7 +355,18 @@
"metadata": {},
"outputs": [],
"source": [
"soma_artifact"
"soma_artifact.versions.df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"For the underlying API, see [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).\n",
"\n",
":::"
]
}
],
Expand Down
4 changes: 2 additions & 2 deletions docs/scrna.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"3. query & inspect artifacts by metadata individually ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna3))\n",
"4. load the joint collection and save analytical results ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna4))\n",
"5. iterate over the collection and train a model ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna5))\n",
"6. discuss converting a collection to a single TileDB SOMA store of the same data ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna6))\n",
"6. concatenate the collection to a single `tiledbsoma` array store ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna-tiledbsoma))\n",
"\n",
"```{toctree}\n",
":maxdepth: 1\n",
Expand All @@ -43,7 +43,7 @@
"scrna3\n",
"scrna4\n",
"scrna5\n",
"scrna6\n",
"scrna-tiledbsoma\n",
"```"
]
},
Expand Down
2 changes: 1 addition & 1 deletion noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"scrna3.ipynb",
"scrna4.ipynb",
"scrna5.ipynb",
"scrna6.ipynb",
"scrna-tiledbsoma.ipynb",
"bulkrna.ipynb",
"facs.ipynb",
"facs2.ipynb",
Expand Down

0 comments on commit 732d54b

Please sign in to comment.