Skip to content

Commit

Permalink
📝 Further improve scrna (#53)
Browse files Browse the repository at this point in the history
  • Loading branch information
sunnyosun authored Aug 28, 2023
1 parent 6cea86c commit 92b6742
Showing 1 changed file with 40 additions and 62 deletions.
102 changes: 40 additions & 62 deletions docs/scrna.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Single-cell RNA-seq (scRNA-seq) measures gene expression of individual cells and generates datasets that are often used to define cell states that associated with functional phenotypes. Data formats, such as [AnnData](https://anndata.readthedocs.io/en/latest/) and [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) objects help storing metadata and data as an entity. However, non-validated metadata are often stored which made it hard to integrate with other datasets.\n",
"scRNA-seq measures gene expression of individual cells. It generates datasets used to define cell states associated with phenotypes.\n",
"\n",
"In this notebook, we show how Lamin can help with manage scRNA-seq data.\n",
"Their analysis is typically based on data objects like [AnnData](https://anndata.readthedocs.io/en/latest/), [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) & [Seurat objects](https://github.com/satijalab/seurat).\n",
"\n",
"These objects, however, often contain non-validated metadata, making data integration hard.\n",
"\n",
"In this notebook, LaminDB is used to make turn `AnnData` objects into validated & queryable assets.\n",
"\n",
"```{toctree}\n",
":maxdepth: 1\n",
Expand Down Expand Up @@ -74,7 +78,9 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
Expand All @@ -87,14 +93,14 @@
"source": [
"### Transform ![](https://img.shields.io/badge/Transform-10b981)\n",
"\n",
"(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)"
"(Here we skip typical transformation steps that involve filtering, normalizing, and formatting.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s look at a scRNA-seq count matrix in form of an AnnData object:"
"Let’s look at an scRNA-seq count matrix in form of an AnnData object:"
]
},
{
Expand All @@ -108,7 +114,7 @@
"outputs": [],
"source": [
"adata = ln.dev.datasets.anndata_human_immune_cells(\n",
" populate_registries=True # pre-populate registries to simulate an used instance\n",
" populate_registries=True # this pre-populates registries\n",
")"
]
},
Expand Down Expand Up @@ -148,7 +154,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We’re seeing that 148 gene identifiers can’t be validated (not currently in the Gene registry). We’d like to validate all features in this dataset, hence, let’s inspect them to see what to do:"
"148 gene identifiers can’t be validated (not currently in the `Gene` registry). Lt’s inspect them to see what to do:"
]
},
{
Expand All @@ -161,14 +167,14 @@
},
"outputs": [],
"source": [
"inspect_result = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)"
"inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect logging says 35 of the non-validated ensembl_gene_ids can be found in Bionty reference. Let's register them:"
"Logging says 35 of the non-validated ids can be found in the Bionty reference. Let's register them:"
]
},
{
Expand All @@ -181,19 +187,17 @@
},
"outputs": [],
"source": [
"records_bionty = lb.Gene.from_values(\n",
" inspect_result.non_validated, lb.Gene.ensembl_gene_id\n",
")\n",
"ln.save(records_bionty)"
"records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)\n",
"ln.save(records)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest 113 aren't present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)). \n",
"The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)).\n",
"\n",
"We'd still like to register them, so let's create Gene records with those ensembl_gene_ids:"
"We'd still like to register them:"
]
},
{
Expand All @@ -206,13 +210,9 @@
},
"outputs": [],
"source": [
"validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id, mute=True)\n",
"nonval_ensembl_ids = adata.var.index[~validated]\n",
"new_records = [\n",
" lb.Gene(ensembl_gene_id=ens_id, species=lb.settings.species)\n",
" for ens_id in nonval_ensembl_ids\n",
"]\n",
"ln.save(new_records)"
"validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)\n",
"records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]\n",
"ln.save(records)"
]
},
{
Expand Down Expand Up @@ -247,13 +247,6 @@
"adata.obs.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1 feature is not validated: donor"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -267,7 +260,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's register it:"
"1 feature is not validated: `\"donor\"`. Let's register it:"
]
},
{
Expand All @@ -276,16 +269,8 @@
"metadata": {},
"outputs": [],
"source": [
"features = ln.Feature.from_df(adata.obs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ln.save(features)"
"feature = ln.Feature.from_df(adata.obs.loc[:, ~validated])[0]\n",
"ln.save(feature)"
]
},
{
Expand All @@ -308,9 +293,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's validate the corresponding labels of each feature:\n",
"Next, let's validate the corresponding labels of each feature.\n",
"\n",
"Some of the metadata labels can be typed using dedicated registries: (e.g. bionty offers ontology-based registries for biological entities)"
"Some of the metadata labels can be typed using dedicated registries like {class}`~docs:lnschema_bionty.CellType`:"
]
},
{
Expand All @@ -326,7 +311,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Register non-validated cell types from Bionty:"
"Register non-validated cell types - they can all be loaded from a public ontology through Bionty:"
]
},
{
Expand Down Expand Up @@ -359,7 +344,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Metadata that can’t be typed with dedicated registries (in this example, we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a Donor registry), we can use the {class}`~lamindb.Label` registry to track donor ids."
"Because we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a `Donor` registry, we use the {class}`~lamindb.Label` registry to track donor ids:"
]
},
{
Expand Down Expand Up @@ -403,7 +388,7 @@
"source": [
"#### Validate external metadata\n",
"\n",
"In addition to what’s already in the file, we’d like to link this file with external features including \"species\" and \"assay\":"
"In addition to what’s already in the file, we’d like to link this file to external features including \"species\" and \"assay\":"
]
},
{
Expand All @@ -420,14 +405,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Validate corresponding labels of these features:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes we don't remember what the term is called exactly, search can help:"
"Let's search for the scRNA-seq assay label:"
]
},
{
Expand Down Expand Up @@ -466,7 +444,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:"
"When we create a `File` object from an `AnnData`, we’ll automatically link its feature sets and get information about unmapped categories:"
]
},
{
Expand Down Expand Up @@ -562,19 +540,20 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"file.features"
"Note that adding labels to an external feature will create an external feature set."
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Note that adding labels to an external feature will create an external feature set."
"file.add_labels(lb.settings.species, feature=\"species\")\n",
"file.add_labels(scrna, feature=\"assay\")"
]
},
{
Expand All @@ -583,8 +562,7 @@
"metadata": {},
"outputs": [],
"source": [
"file.add_labels(lb.settings.species, feature=\"species\")\n",
"file.add_labels(scrna, feature=\"assay\")"
"file.features"
]
},
{
Expand Down

0 comments on commit 92b6742

Please sign in to comment.