💄 Polish scrna-tiledbsoma (#150)

laminlabs · Sep 4, 2024 · 732d54b · 732d54b
1 parent 8613ce8
commit 732d54b
Show file tree

Hide file tree

Showing 3 changed files with 75 additions and 104 deletions.
diff --git a/docs/scrna6.ipynb → docs/scrna-tiledbsoma.ipynb b/docs/scrna6.ipynb → docs/scrna-tiledbsoma.ipynb
@@ -20,26 +20,9 @@
    "source": [
     "In the previous notebooks, we've seen how to incrementally create a collection of scRNA-seq datasets and train models on it.\n",
     "\n",
-    "Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this [blog post](https://lamin.ai/blog/arrayloader-benchmarks)). This is what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`).\n",
+    "Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.\n",
     "\n",
-    ":::{note}\n",
-    "\n",
-    "This notebook shows how `lamindb` can be used with `tiledbsoma` append mode, also expained in [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).\n",
-    "\n",
-    ":::"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import lamindb as ln\n",
-    "import pandas as pd\n",
-    "import scanpy as sc\n",
-    "import tiledbsoma.io\n",
-    "from functools import reduce"
+    "This is also what CELLxGENE does to create Census: a number of `.h5ad` files are concatenated to give rise to a single `tiledbsoma` array store ({doc}`docs:cellxgene`)."
    ]
   },
   {
@@ -52,6 +35,12 @@
    },
    "outputs": [],
    "source": [
+    "import lamindb as ln\n",
+    "import pandas as pd\n",
+    "import scanpy as sc\n",
+    "import tiledbsoma.io\n",
+    "from functools import reduce\n",
+    "\n",
     "ln.context.uid = \"oJN8WmVrxI8m0000\"\n",
     "ln.context.track()"
    ]
@@ -60,7 +49,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Query the collection of `h5ad` files that we'd like to convert into a single array."
+    "Query the collection of `h5ad` files that we'd like to concatenate into a single array."
    ]
   },
   {
@@ -90,7 +79,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We need to prepare the`AnnData` objects in the collection to be concatenated into one `tiledbsoma.Experiment`. They need to have the same `.var` and `.obs` columns, `.uns` and `.obsp` should be removed."
+    "To concatenate the `AnnData` objects into a single `tiledbsoma.Experiment`, they need to have the same `.var` and `.obs` columns."
    ]
   },
   {
@@ -99,32 +88,20 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "adatas = [artifact.load() for artifact in collection.ordered_artifacts]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Compute the intersetion of all columns. All `AnnData` objects should have the same columns in their `.obs`, `.var`, `.raw.var` to be ingested into one `tiledbsoma.Experiment`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])\n",
-    "var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])\n",
-    "var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])"
+    "# load a number of AnnData objects that's small enough to fit into memory\n",
+    "adatas = [artifact.load() for artifact in collection.ordered_artifacts]\n",
+    "\n",
+    "# compute the intersection of columns for these objects\n",
+    "var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])  # this only affects metadata columns of features (say, gene annotations)\n",
+    "var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])\n",
+    "obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])  # this actually subsets features (dataset dimensions)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop slots. Here we have to drop `.obsp`, `.uns` and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide `obs` and `var` names in `ln.integrations.save_tiledbsoma_experiment`, so we create these fileds (`obs_id`, `var_id`) from the dataframe indices."
+    "Prepare the `AnnData` objects for concatenation. Prepare id fields, sanitize `index` names, intersect columns, drop `.obsp`, `.uns` and columns that aren't part of the intersection."
    ]
   },
   {
@@ -134,15 +111,15 @@
    "outputs": [],
    "source": [
     "for i, adata in enumerate(adatas):\n",
-    "    del adata.obsp\n",
-    "    del adata.uns\n",
+    "    del adata.obsp  # not supported by tiledbsoma\n",
+    "    del adata.uns   # not supported by tiledbsoma\n",
     "    \n",
-    "    adata.obs = adata.obs.filter(obs_columns)\n",
-    "    adata.obs[\"obs_id\"] = adata.obs.index\n",
+    "    adata.obs = adata.obs.filter(obs_columns)  # filter columns to intersection\n",
+    "    adata.obs[\"obs_id\"] = adata.obs.index  # prepare a column for tiledbsoma to use as an index\n",
     "    adata.obs[\"dataset\"] = i\n",
     "    adata.obs.index.name = None\n",
     "    \n",
-    "    adata.var = adata.var.filter(var_columns)\n",
+    "    adata.var = adata.var.filter(var_columns)  # filter columns to intersection\n",
     "    adata.var[\"var_id\"] = adata.var.index\n",
     "    adata.var.index.name = None\n",
     "    \n",
@@ -163,9 +140,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Ingest the `AnnData` objects. This saves the `AnnData` objects in one array store, creates `Artifact` and saves it. This function also writes current `run.uid` to `tiledbsoma.Experiment` `obs`, under `lamin_run_uid`. \n",
-    "\n",
-    "If you know `tiledbsoma` API, then note, that `ln.integrations.save_tiledbsoma_experiment` includes both `tiledbsoma.io.register_anndatas` and `tiledbsoma.io.from_anndata`."
+    "Save the `AnnData` objects in one array store referenced by an `Artifact`."
    ]
   },
   {
@@ -188,6 +163,19 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    ":::{note}\n",
+    "\n",
+    "Provenance is tracked by writing the current `run.uid` to `tiledbsoma.Experiment.obs` as `lamin_run_uid`.\n",
+    "\n",
+    "If you know `tiledbsoma` API, then note that {func}`~docs:lamindb.integrations.save_tiledbsoma_experiment` abstracts over both `tiledbsoma.io.register_anndatas` and `tiledbsoma.io.from_anndata`.\n",
+    "\n",
+    ":::"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -199,7 +187,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Open and query the experiment. We can use the registered `Artifact`. Here we query `obs` from the array store."
+    "Here we query the `obs` from the array store."
    ]
   },
   {
@@ -221,14 +209,14 @@
     "    \n",
     "    obs_store_df = obs.read().concat().to_pandas()\n",
     "    \n",
-    "    print(obs_store_df)"
+    "    display(obs_store_df)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Append `AnnData` to the array store"
+    "## Append to the array store"
    ]
   },
   {
@@ -244,15 +232,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "adata = ln.core.datasets.anndata_with_obs()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "adata = ln.core.datasets.anndata_with_obs()\n",
+    "\n",
     "adata.obs_names_make_unique()\n",
     "adata.var_names_make_unique()\n",
     "\n",
@@ -265,23 +246,16 @@
     "adata.obs = adata.obs[obs_columns_same]\n",
     "\n",
     "var_columns_same = [var_col for var_col in adata.var.columns if var_col in var_columns_store]\n",
-    "adata.var = adata.var[var_columns_same]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "adata.var = adata.var[var_columns_same]\n",
+    "\n",
     "adata.write_h5ad(\"adata_to_append.h5ad\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Append the `AnnData` object from disk. This also creates a new version of `soma_artifact`."
+    "Append the `AnnData` object from disk by revising `soma_artifact`."
    ]
   },
   {
@@ -314,7 +288,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Read `X` from the store."
+    "Add a new embedding to the existing array store."
    ]
   },
   {
@@ -323,43 +297,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "with soma_artifact.open() as soma_store: # mode=\"r\" by default\n",
+    "# read the data matrix\n",
+    "with soma_artifact.open() as soma_store:\n",
     "    ms_rna = soma_store[\"ms\"][\"RNA\"]\n",
     "    n_obs = len(soma_store[\"obs\"])\n",
     "    n_var = len(ms_rna[\"var\"])\n",
-    "    X = ms_rna[\"X\"][\"data\"].read().coos((n_obs, n_var)).concat().to_scipy()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Calculate PCA from the queried `X`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "    X = ms_rna[\"X\"][\"data\"].read().coos((n_obs, n_var)).concat().to_scipy()\n",
+    "\n",
+    "# calculate PCA embedding from the queried `X`\n",
     "pca_array = sc.pp.pca(X, n_comps=2)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "soma_artifact"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version. "
+    "Open the array store in write mode and add PCA."
    ]
   },
   {
@@ -386,7 +339,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that the artifact has been changed."
+    "## See array store mutations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance."
    ]
   },
   {
@@ -395,7 +355,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "soma_artifact"
+    "soma_artifact.versions.df()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    ":::{note}\n",
+    "\n",
+    "For the underlying API, see [the tiledbsoma documentation](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html).\n",
+    "\n",
+    ":::"
    ]
   }
  ],

diff --git a/docs/scrna.ipynb b/docs/scrna.ipynb
@@ -33,7 +33,7 @@
     "3. query & inspect artifacts by metadata individually ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna3))\n",
     "4. load the joint collection and save analytical results ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna4))\n",
     "5. iterate over the collection and train a model ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna5))\n",
-    "6. discuss converting a collection to a single TileDB SOMA store of the same data ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna6))\n",
+    "6. concatenate the collection to a single `tiledbsoma` array store ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna-tiledbsoma))\n",
     "\n",
     "```{toctree}\n",
     ":maxdepth: 1\n",
@@ -43,7 +43,7 @@
     "scrna3\n",
     "scrna4\n",
     "scrna5\n",
-    "scrna6\n",
+    "scrna-tiledbsoma\n",
     "```"
    ]
   },

diff --git a/noxfile.py b/noxfile.py
@@ -22,7 +22,7 @@
     "scrna3.ipynb",
     "scrna4.ipynb",
     "scrna5.ipynb",
-    "scrna6.ipynb",
+    "scrna-tiledbsoma.ipynb",
     "bulkrna.ipynb",
     "facs.ipynb",
     "facs2.ipynb",