📝 Further improve scrna (#53)

laminlabs · Aug 28, 2023 · 92b6742 · 92b6742
1 parent 6cea86c
commit 92b6742
Showing 1 changed file with 40 additions and 62 deletions.
diff --git a/docs/scrna.ipynb b/docs/scrna.ipynb
@@ -19,9 +19,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Single-cell RNA-seq (scRNA-seq) measures gene expression of individual cells and generates datasets that are often used to define cell states that associated with functional phenotypes. Data formats, such as [AnnData](https://anndata.readthedocs.io/en/latest/) and [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) objects help storing metadata and data as an entity. However, non-validated metadata are often stored which made it hard to integrate with other datasets.\n",
+    "scRNA-seq measures gene expression of individual cells. It generates datasets used to define cell states associated with phenotypes.\n",
     "\n",
-    "In this notebook, we show how Lamin can help with manage scRNA-seq data.\n",
+    "Their analysis is typically based on data objects like [AnnData](https://anndata.readthedocs.io/en/latest/), [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) & [Seurat objects](https://github.com/satijalab/seurat).\n",
+    "\n",
+    "These objects, however, often contain non-validated metadata, making data integration hard.\n",
+    "\n",
+    "In this notebook, LaminDB is used to make turn `AnnData` objects into validated & queryable assets.\n",
     "\n",
     "```{toctree}\n",
     ":maxdepth: 1\n",
@@ -74,7 +78,9 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "tags": []
+    "tags": [
+     "hide-output"
+    ]
    },
    "outputs": [],
    "source": [
@@ -87,14 +93,14 @@
    "source": [
     "### Transform ![](https://img.shields.io/badge/Transform-10b981)\n",
     "\n",
-    "(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)"
+    "(Here we skip typical transformation steps that involve filtering, normalizing, and formatting.)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let’s look at a scRNA-seq count matrix in form of an AnnData object:"
+    "Let’s look at an scRNA-seq count matrix in form of an AnnData object:"
    ]
   },
   {
@@ -108,7 +114,7 @@
    "outputs": [],
    "source": [
     "adata = ln.dev.datasets.anndata_human_immune_cells(\n",
-    "    populate_registries=True  # pre-populate registries to simulate an used instance\n",
+    "    populate_registries=True  # this pre-populates registries\n",
     ")"
    ]
   },
@@ -148,7 +154,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We’re seeing that 148 gene identifiers can’t be validated (not currently in the Gene registry). We’d like to validate all features in this dataset, hence, let’s inspect them to see what to do:"
+    "148 gene identifiers can’t be validated (not currently in the `Gene` registry). Lt’s inspect them to see what to do:"
    ]
   },
   {
@@ -161,14 +167,14 @@
    },
    "outputs": [],
    "source": [
-    "inspect_result = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)"
+    "inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Inspect logging says 35 of the non-validated ensembl_gene_ids can be found in Bionty reference. Let's register them:"
+    "Logging says 35 of the non-validated ids can be found in the Bionty reference. Let's register them:"
    ]
   },
   {
@@ -181,19 +187,17 @@
    },
    "outputs": [],
    "source": [
-    "records_bionty = lb.Gene.from_values(\n",
-    "    inspect_result.non_validated, lb.Gene.ensembl_gene_id\n",
-    ")\n",
-    "ln.save(records_bionty)"
+    "records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)\n",
+    "ln.save(records)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The rest 113 aren't present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)). \n",
+    "The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)).\n",
     "\n",
-    "We'd still like to register them, so let's create Gene records with those ensembl_gene_ids:"
+    "We'd still like to register them:"
    ]
   },
   {
@@ -206,13 +210,9 @@
    },
    "outputs": [],
    "source": [
-    "validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id, mute=True)\n",
-    "nonval_ensembl_ids = adata.var.index[~validated]\n",
-    "new_records = [\n",
-    "    lb.Gene(ensembl_gene_id=ens_id, species=lb.settings.species)\n",
-    "    for ens_id in nonval_ensembl_ids\n",
-    "]\n",
-    "ln.save(new_records)"
+    "validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)\n",
+    "records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]\n",
+    "ln.save(records)"
    ]
   },
   {
@@ -247,13 +247,6 @@
     "adata.obs.columns"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "1 feature is not validated: donor"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -267,7 +260,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's register it:"
+    "1 feature is not validated: `\"donor\"`. Let's register it:"
    ]
   },
   {
@@ -276,16 +269,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "features = ln.Feature.from_df(adata.obs)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "ln.save(features)"
+    "feature = ln.Feature.from_df(adata.obs.loc[:, ~validated])[0]\n",
+    "ln.save(feature)"
    ]
   },
   {
@@ -308,9 +293,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, let's validate the corresponding labels of each feature:\n",
+    "Next, let's validate the corresponding labels of each feature.\n",
     "\n",
-    "Some of the metadata labels can be typed using dedicated registries: (e.g. bionty offers ontology-based registries for biological entities)"
+    "Some of the metadata labels can be typed using dedicated registries like {class}`~docs:lnschema_bionty.CellType`:"
    ]
   },
   {
@@ -326,7 +311,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Register non-validated cell types from Bionty:"
+    "Register non-validated cell types - they can all be loaded from a public ontology through Bionty:"
    ]
   },
   {
@@ -359,7 +344,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Metadata that can’t be typed with dedicated registries (in this example, we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a Donor registry), we can use the {class}`~lamindb.Label` registry to track donor ids."
+    "Because we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a `Donor` registry, we use the {class}`~lamindb.Label` registry to track donor ids:"
    ]
   },
   {
@@ -403,7 +388,7 @@
    "source": [
     "#### Validate external metadata\n",
     "\n",
-    "In addition to what’s already in the file, we’d like to link this file with external features including \"species\" and \"assay\":"
+    "In addition to what’s already in the file, we’d like to link this file to external features including \"species\" and \"assay\":"
    ]
   },
   {
@@ -420,14 +405,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Validate corresponding labels of these features:"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Sometimes we don't remember what the term is called exactly, search can help:"
+    "Let's search for the scRNA-seq assay label:"
    ]
   },
   {
@@ -466,7 +444,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:"
+    "When we create a `File` object from an `AnnData`, we’ll automatically link its feature sets and get information about unmapped categories:"
    ]
   },
   {
@@ -562,19 +540,20 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "file.features"
+    "Note that adding labels to an external feature will create an external feature set."
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "Note that adding labels to an external feature will create an external feature set."
+    "file.add_labels(lb.settings.species, feature=\"species\")\n",
+    "file.add_labels(scrna, feature=\"assay\")"
    ]
   },
   {
@@ -583,8 +562,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "file.add_labels(lb.settings.species, feature=\"species\")\n",
-    "file.add_labels(scrna, feature=\"assay\")"
+    "file.features"
    ]
   },
   {