Skip to content

Commit

Permalink
Improve Dimensionality Reduction task (#401)
Browse files Browse the repository at this point in the history
* clean up configs

* add pymde

* add R-based implementation of diffusionmap

* remove comment

* change label

* fix task description

* add more methods and metrics to workflow

* update authors

* rename folder

* fix filename

* add workaround script.py

* fix spectral features

* update task info

* add author info

* set lmds dims to 2
  • Loading branch information
rcannood authored Mar 13, 2024
1 parent 0aaef65 commit 6272797
Show file tree
Hide file tree
Showing 24 changed files with 245 additions and 70 deletions.
17 changes: 17 additions & 0 deletions src/common/library.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1695,3 +1695,20 @@ @article{huang2018savergene
month = jun,
pages = {539–542}
}


@article{chari2023speciousart,
title = {The specious art of single-cell genomics},
volume = {19},
ISSN = {1553-7358},
url = {http://dx.doi.org/10.1371/journal.pcbi.1011288},
DOI = {10.1371/journal.pcbi.1011288},
number = {8},
journal = {PLOS Computational Biology},
publisher = {Public Library of Science (PLoS)},
author = {Chari, Tara and Pachter, Lior},
editor = {Papin, Jason A.},
year = {2023},
month = aug,
pages = {e1011288}
}
67 changes: 44 additions & 23 deletions src/tasks/dimensionality_reduction/api/task_info.yaml
Original file line number Diff line number Diff line change
@@ -1,27 +1,38 @@
name: dimensionality_reduction
label: "Dimensionality reduction for visualization"
label: "Dimensionality reduction for 2D visualization"
v1:
path: openproblems/tasks/dimensionality_reduction/README.md
commit: b353a462f6ea353e0fc43d0f9fcbbe621edc3a0b
summary: Reduction of high-dimensional datasets to 2D for visualization & interpretation
image: "thumbnail.svg"
motivation: |
Dimensionality reduction is one of the key challenges in single-cell data representation.
Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly
20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional
elements encoded in mRNA such as lncRNAs). Since its inception,scRNA-seq experiments have
been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq
experiments would yield a few hundred cells, at best. Now, it is not uncommon to see
experiments that yield over [100,000 cells](<https://www.nature.com/articles/s41586-018-0590-4>)
or even [> 1 million cells](https://doi.org/10.1126/science.aba7721).
Data visualisation is an important part of all stages of single-cell analysis, from
initial quality control to interpretation and presentation of final results. For bulk RNA-seq
studies, linear dimensionality reduction techniques such as PCA and MDS are commonly used
to visualise the variation between samples. While these methods are highly effective they
can only be used to show the first few components of variation which cannot fully represent
the increased complexity and number of observations in single-cell datasets. For this reason
non-linear techniques (most notably t-SNE and UMAP) have become the standard for visualising
single-cell studies. These methods attempt to compress a dataset into a two-dimensional space
while attempting to capture as much of the variance between observations as possible. Many
methods for solving this problem now exist. In general these methods try to preserve distances,
while some additionally consider aspects such as density within the embedded space or conservation
of continuous trajectories. Despite almost every single-cell study using one of these visualisations
there has been debate as to whether they can effectively capture the variation in single-cell
datasets [@chari2023speciousart].
description: |
Each *feature* in a dataset functions as a single dimension. While each of the ~30,000
dimensions measured in each cell contribute to an underlying data structure, the overall
structure of the data is challenging to display in few dimensions due to data sparsity
and the [*"curse of dimensionality"*](https://en.wikipedia.org/wiki/Curse_of_dimensionality)
(distances in high dimensional data don't distinguish data points well). Thus, we need to find
a way to [dimensionally reduce](https://en.wikipedia.org/wiki/Dimensionality_reduction)
the data for visualization and interpretation.
The dimensionality reduction task attempts to quantify the ability of methods to embed the
information present in complex single-cell studies into a two-dimensional space. Thus, this task
is specifically designed for dimensionality reduction for visualisation and does not consider other
uses of dimensionality reduction in standard single-cell workflows such as improving the
signal-to-noise ratio (and in fact several of the methods use PCA as a pre-processing step for this
reason). Unlike most tasks, methods for the dimensionality reduction task must accept a matrix
containing expression values normalised to 10,000 counts per cell and log transformed (log-10k) and
produce a two-dimensional coordinate for each cell. Pre-normalised matrices are required to
enforce consistency between the metric evaluation (which generally requires normalised data) and
the method runs. When these are not consistent, methods that use the same normalisation as used in
the metric tend to score more highly. For some methods we also evaluate the pre-processing
recommended by the method.
authors:
- name: Luke Zappia
roles: [ maintainer, author ]
Expand All @@ -31,7 +42,7 @@ authors:
roles: [ author ]
info:
github: michalk8
- name: "Scott Gigante"
- name: Scott Gigante
roles: [ author ]
info:
github: scottgigante
Expand All @@ -40,13 +51,23 @@ authors:
roles: [ author ]
info:
github: bendemeo
- name: "Juan A. Cordero Varela"
- name: Robrecht Cannoodt
roles: [ author ]
info:
github: rcannood
orcid: 0000-0003-3641-729X
- name: Kai Waldrant
roles: [ contributor ]
info:
github: jacorvar
orcid: 0000-0002-7373-5433
- name: Robrecht Cannoodt
github: KaiWaldrant
orcid: 0009-0003-8555-1361
- name: Sai Nirmayi Yasa
roles: [ contributor ]
info:
github: rcannood
orcid: "0000-0003-3641-729X"
github: sainirmayi
orcid: 0009-0003-6319-9803
- name: Juan A. Cordero Varela
roles: [ contributor ]
info:
github: jacorvar
orcid: 0000-0002-7373-5433
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ platforms:
image: ghcr.io/openproblems-bio/base_python:1.0.2
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ functionality:
description: "Number of times to retry if the embedding fails, each time adding noise."
resources:
- type: python_script
path: /src/tasks/dimensionality_reduction/methods/diffusion_map/script.py
path: script.py
platforms:
- type: docker
image: ghcr.io/openproblems-bio/base_python:1.0.2
Expand All @@ -38,4 +38,4 @@ platforms:
- numpy
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ platforms:
image: ghcr.io/openproblems-bio/base_python:1.0.2
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,4 @@ platforms:
- type: native
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
@@ -1,44 +1,31 @@
__merge__: ../../api/comp_method.yaml
functionality:
name: "diffusion_maps"
name: diffusion_map
info:
label: Diffusion maps
summary: "Positive control by Use 1000-dimensional diffusions maps as an embedding."
description: "This serves as a positive control since it uses 1000-dimensional diffusions maps as an embedding"
label: Diffusion Map
summary: Finding meaningful geometric descriptions of datasets using diffusion maps.
description: Implements diffusion map method of data parametrization, including creation and visualization of diffusion map, clustering with diffusion K-means and regression using adaptive regression model.
reference: coifman2006diffusion
documentation_url: https://github.com/openproblems-bio/openproblems
repository_url: https://github.com/openproblems-bio/openproblems
documentation_url: https://bioconductor.org/packages/release/bioc/html/destiny.html
repository_url: https://github.com/theislab/destiny
v1:
path: openproblems/tasks/dimensionality_reduction/methods/diffusion_map.py
commit: b3456fd73c04c28516f6df34c57e6e3e8b0dab32
preferred_normalization: log_cp10k
variants:
diffusion_map:
resources:
- type: r_script
path: script.R
arguments:
- name: "--n_comps"
type: integer
default: 2
description: "Number of components to use for the embedding."
- name: t
- name: "--n_dim"
type: integer
default: 1
description: "Number to power the eigenvalues by."
- name: n_retries
type: integer
default: 1
description: "Number of times to retry if the embedding fails, each time adding noise."
resources:
- type: python_script
path: script.py
description: Number of dimensions.
default: 3
platforms:
- type: docker
image: ghcr.io/openproblems-bio/base_python:1.0.2
image: ghcr.io/openproblems-bio/base_r:1.0.2
setup:
- type: python
pypi:
- umap-learn
- scipy
- numpy
- type: r
bioc: destiny
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
37 changes: 37 additions & 0 deletions src/tasks/dimensionality_reduction/methods/diffusion_map/script.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
requireNamespace("anndata", quietly = TRUE)
requireNamespace("diffusionMap", quietly = TRUE)

## VIASH START
par <- list(
input = "resources_test/dimensionality_reduction/pancreas/dataset.h5ad",
output = "output.h5ad",
n_dim = 3
)
## VIASH END

cat("Reading input files\n")
input <- anndata::read_h5ad(par$input)

cat("Running destiny diffusion map\n")
# create SummarizedExperiment object
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
logcounts = t(as.matrix(input$layers[["normalized"]]))
)
)
dm <- destiny::DiffusionMap(sce)
X_emb <- destiny::eigenvectors(dm)[, seq_len(par$n_dim)]

cat("Write output AnnData to file\n")
output <- anndata::AnnData(
uns = list(
dataset_id = input$uns[["dataset_id"]],
normalization_id = input$uns[["normalization_id"]],
method_id = meta$functionality_name
),
obsm = list(
X_emb = X_emb
),
shape = input$shape
)
output$write_h5ad(par$output, compression = "gzip")
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,4 @@ platforms:
- ivis[cpu]
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ functionality:
- name: "--n_dim"
type: integer
description: Number of dimensions.
default: 3
default: 2
- name: "--n_landmarks"
type: integer
description: Number of landmarks.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,4 @@ platforms:
- "git+https://github.com/michalk8/neuralee@8946abf"
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ functionality:
is calculated on the logCPM expression matrix with and without selecting 1000
HVGs.
reference: pearson1901pca
repository_url: "https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html"
documentation_url: "https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html"
repository_url: https://github.com/scikit-learn/scikit-learn
documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
v1:
path: openproblems/tasks/dimensionality_reduction/methods/pca.py
commit: 154ccb9fd99113f3d28d9c3f139194539a0290f9
Expand All @@ -37,4 +37,4 @@ platforms:
packages: scanpy
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,4 @@ platforms:
- "scikit-learn<1.2"
- type: nextflow
directives:
label: [ "midtime", highmem, highcpu ]
label: [ midtime, highmem, highcpu ]
41 changes: 41 additions & 0 deletions src/tasks/dimensionality_reduction/methods/pymde/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
__merge__: ../../api/comp_method.yaml
functionality:
name: pymde
info:
label: PyMDE
summary: "A Python implementation of Minimum-Distortion Embedding"
description: |
PyMDE is a Python implementation of Minimum-Distortion Embedding. It is a non-linear
method that preserves distances between cells or neighbourhoods in the original space.
reference: agrawal2021mde
repository_url: https://github.com/cvxgrp/pymde
documentation_url: https://pymde.org
v1:
path: openproblems/tasks/dimensionality_reduction/methods/pymde.py
commit: b3456fd73c04c28516f6df34c57e6e3e8b0dab32
preferred_normalization: log_cp10k
arguments:
- name: --embed_method
type: string
description: The method to use for embedding. Options are 'umap' and 'tsne'.
default: neighbors
choices: [ neighbors, distances ]
- name: --n_hvg
type: integer
description: Number of highly variable genes to subset to. If not specified, the input matrix will not be subset.
- name: --n_pca_dims
type: integer
description: Number of principal components to use for the initial PCA step.
default: 100
resources:
- type: python_script
path: script.py
platforms:
- type: docker
image: ghcr.io/openproblems-bio/base_python:1.0.2
setup:
- type: python
packages: pymde
- type: nextflow
directives:
label: [ midtime, highmem, highcpu ]
59 changes: 59 additions & 0 deletions src/tasks/dimensionality_reduction/methods/pymde/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import anndata as ad
import scanpy as sc
import pymde

## VIASH START
par = {
"input": "resources_test/dimensionality_reduction/pancreas/dataset.h5ad",
"output": "reduced.h5ad",
"embed_method": "neighbors",
"n_hvg": 1000,
"n_pca_dims": 50,
}
meta = {
"functionality_name": "foo",
}
## VIASH END

if par["embed_method"] == "neighbors":
mde_fn = pymde.preserve_neighbors
elif par["embed_method"] == "distances":
mde_fn = pymde.preserve_distances
else:
raise ValueError(f"Unknown embedding method: {par['embed_method']}")

print("Load input data", flush=True)
input = ad.read_h5ad(par["input"])
X_mat = input.layers["normalized"]

if par["n_hvg"]:
print(f"Select top {par['n_hvg']} high variable genes", flush=True)
idx = input.var["hvg_score"].to_numpy().argsort()[::-1][:par["n_hvg"]]
X_mat = X_mat[:, idx]

print(f"Compute PCA", flush=True)
X_pca = sc.tl.pca(X_mat, n_comps=par["n_pca_dims"], svd_solver="arpack")

print(f"Run MDE", flush=True)
X_emb = (
mde_fn(X_pca, embedding_dim=2, verbose=True)
.embed(verbose=True)
.detach()
.numpy()
)

print("Create output AnnData", flush=True)
output = ad.AnnData(
obs=input.obs[[]],
obsm={
"X_emb": X_emb
},
uns={
"dataset_id": input.uns["dataset_id"],
"normalization_id": input.uns["normalization_id"],
"method_id": meta["functionality_name"]
}
)

print("Write output to file", flush=True)
output.write_h5ad(par["output"], compression="gzip")
Loading

0 comments on commit 6272797

Please sign in to comment.