Improve Dimensionality Reduction task (#401)

* clean up configs * add pymde * add R-based implementation of diffusionmap * remove comment * change label * fix task description * add more methods and metrics to workflow * update authors * rename folder * fix filename * add workaround script.py * fix spectral features * update task info * add author info * set lmds dims to 2
openproblems-bio · Mar 13, 2024 · 6272797 · 6272797
1 parent 0aaef65
commit 6272797
Show file tree

Hide file tree

Showing 24 changed files with 245 additions and 70 deletions.
diff --git a/src/common/library.bib b/src/common/library.bib
@@ -1695,3 +1695,20 @@ @article{huang2018savergene
   month = jun,
   pages = {539–542}
 }
+
+
+@article{chari2023speciousart,
+  title = {The specious art of single-cell genomics},
+  volume = {19},
+  ISSN = {1553-7358},
+  url = {http://dx.doi.org/10.1371/journal.pcbi.1011288},
+  DOI = {10.1371/journal.pcbi.1011288},
+  number = {8},
+  journal = {PLOS Computational Biology},
+  publisher = {Public Library of Science (PLoS)},
+  author = {Chari,  Tara and Pachter,  Lior},
+  editor = {Papin,  Jason A.},
+  year = {2023},
+  month = aug,
+  pages = {e1011288}
+}
diff --git a/src/tasks/dimensionality_reduction/api/task_info.yaml b/src/tasks/dimensionality_reduction/api/task_info.yaml
@@ -1,27 +1,38 @@
 name: dimensionality_reduction
-label: "Dimensionality reduction for visualization"
+label: "Dimensionality reduction for 2D visualization"
 v1:
   path: openproblems/tasks/dimensionality_reduction/README.md
   commit: b353a462f6ea353e0fc43d0f9fcbbe621edc3a0b
 summary: Reduction of high-dimensional datasets to 2D for visualization & interpretation
 image: "thumbnail.svg"
 motivation: |
-  Dimensionality reduction is one of the key challenges in single-cell data representation.
-  Routine single-cell RNA sequencing (scRNA-seq) experiments measure cells in roughly
-  20,000-30,000 dimensions (i.e., features - mostly gene transcripts but also other functional
-  elements encoded in mRNA such as lncRNAs). Since its inception,scRNA-seq experiments have
-  been growing in terms of the number of cells measured. Originally, cutting-edge SmartSeq
-  experiments would yield a few hundred cells, at best. Now, it is not uncommon to see
-  experiments that yield over [100,000 cells](<https://www.nature.com/articles/s41586-018-0590-4>)
-  or even [> 1 million cells](https://doi.org/10.1126/science.aba7721).
+  Data visualisation is an important part of all stages of single-cell analysis, from
+  initial quality control to interpretation and presentation of final results. For bulk RNA-seq
+  studies, linear dimensionality reduction techniques such as PCA and MDS are commonly used
+  to visualise the variation between samples. While these methods are highly effective they
+  can only be used to show the first few components of variation which cannot fully represent
+  the increased complexity and number of observations in single-cell datasets. For this reason
+  non-linear techniques (most notably t-SNE and UMAP) have become the standard for visualising
+  single-cell studies. These methods attempt to compress a dataset into a two-dimensional space
+  while attempting to capture as much of the variance between observations as possible. Many
+  methods for solving this problem now exist. In general these methods try to preserve distances,
+  while some additionally consider aspects such as density within the embedded space or conservation
+  of continuous trajectories. Despite almost every single-cell study using one of these visualisations
+  there has been debate as to whether they can effectively capture the variation in single-cell
+  datasets [@chari2023speciousart].
 description: |
-  Each *feature* in a dataset functions as a single dimension. While each of the ~30,000
-  dimensions measured in each cell contribute to an underlying data structure, the overall
-  structure of the data is challenging to display in few dimensions due to data sparsity
-  and the [*"curse of dimensionality"*](https://en.wikipedia.org/wiki/Curse_of_dimensionality)
-  (distances in high dimensional data don't distinguish data points well). Thus, we need to find
-  a way to [dimensionally reduce](https://en.wikipedia.org/wiki/Dimensionality_reduction)
-  the data for visualization and interpretation.
+  The dimensionality reduction task attempts to quantify the ability of methods to embed the
+  information present in complex single-cell studies into a two-dimensional space. Thus, this task
+  is specifically designed for dimensionality reduction for visualisation and does not consider other
+  uses of dimensionality reduction in standard single-cell workflows such as improving the
+  signal-to-noise ratio (and in fact several of the methods use PCA as a pre-processing step for this
+  reason). Unlike most tasks, methods for the dimensionality reduction task must accept a matrix
+  containing expression values normalised to 10,000 counts per cell and log transformed (log-10k) and
+  produce a two-dimensional coordinate for each cell. Pre-normalised matrices are required to
+  enforce consistency between the metric evaluation (which generally requires normalised data) and
+  the method runs. When these are not consistent, methods that use the same normalisation as used in
+  the metric tend to score more highly. For some methods we also evaluate the pre-processing
+  recommended by the method.
 authors:
   - name: Luke Zappia
     roles: [ maintainer, author ]
@@ -31,7 +42,7 @@ authors:
     roles: [ author ]
     info:
       github: michalk8
-  - name: "Scott Gigante"
+  - name: Scott Gigante
     roles: [ author ]
     info:
       github: scottgigante
@@ -40,13 +51,23 @@ authors:
     roles: [ author ]
     info:
       github: bendemeo
-  - name: "Juan A. Cordero Varela"
+  - name: Robrecht Cannoodt
+    roles: [ author ]
+    info:
+      github: rcannood
+      orcid: 0000-0003-3641-729X
+  - name: Kai Waldrant
     roles: [ contributor ]
     info:
-      github: jacorvar
-      orcid: 0000-0002-7373-5433
-  - name: Robrecht Cannoodt
+      github: KaiWaldrant
+      orcid: 0009-0003-8555-1361
+  - name: Sai Nirmayi Yasa
     roles: [ contributor ]
     info:
-      github: rcannood
-      orcid: "0000-0003-3641-729X"
+      github: sainirmayi
+      orcid: 0009-0003-6319-9803
+  - name: Juan A. Cordero Varela
+    roles: [ contributor ]
+    info:
+      github: jacorvar
+      orcid: 0000-0002-7373-5433
diff --git a/src/tasks/dimensionality_reduction/control_methods/random_features/config.vsh.yaml b/src/tasks/dimensionality_reduction/control_methods/random_features/config.vsh.yaml
@@ -19,4 +19,4 @@ platforms:
     image: ghcr.io/openproblems-bio/base_python:1.0.2
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/control_methods/spectral_features/config.vsh.yaml b/src/tasks/dimensionality_reduction/control_methods/spectral_features/config.vsh.yaml
@@ -26,7 +26,7 @@ functionality:
       description: "Number of times to retry if the embedding fails, each time adding noise."
   resources:
     - type: python_script
-      path: /src/tasks/dimensionality_reduction/methods/diffusion_map/script.py
+      path: script.py
 platforms:
   - type: docker
     image: ghcr.io/openproblems-bio/base_python:1.0.2
@@ -38,4 +38,4 @@ platforms:
           - numpy
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/...reduction/methods/diffusion_map/script.py → ...ntrol_methods/spectral_features/script.py b/...reduction/methods/diffusion_map/script.py → ...ntrol_methods/spectral_features/script.py
diff --git a/src/tasks/dimensionality_reduction/control_methods/true_features/config.vsh.yaml b/src/tasks/dimensionality_reduction/control_methods/true_features/config.vsh.yaml
@@ -19,4 +19,4 @@ platforms:
     image: ghcr.io/openproblems-bio/base_python:1.0.2
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/densmap/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/densmap/config.vsh.yaml
@@ -42,4 +42,4 @@ platforms:
   - type: native
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/diffusion_map/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/diffusion_map/config.vsh.yaml
@@ -1,44 +1,31 @@
 __merge__: ../../api/comp_method.yaml
 functionality:
-  name: "diffusion_maps"
+  name: diffusion_map
   info:
-    label: Diffusion maps
-    summary: "Positive control by Use 1000-dimensional diffusions maps as an embedding."
-    description: "This serves as a positive control since it uses 1000-dimensional diffusions maps as an embedding"
+    label: Diffusion Map
+    summary: Finding meaningful geometric descriptions of datasets using diffusion maps.
+    description: Implements diffusion map method of data parametrization, including creation and visualization of diffusion map, clustering with diffusion K-means and regression using adaptive regression model. 
     reference: coifman2006diffusion
-    documentation_url: https://github.com/openproblems-bio/openproblems
-    repository_url: https://github.com/openproblems-bio/openproblems
+    documentation_url: https://bioconductor.org/packages/release/bioc/html/destiny.html
+    repository_url: https://github.com/theislab/destiny
     v1:
       path: openproblems/tasks/dimensionality_reduction/methods/diffusion_map.py
       commit: b3456fd73c04c28516f6df34c57e6e3e8b0dab32
     preferred_normalization: log_cp10k
-    variants:
-      diffusion_map:
+  resources:
+    - type: r_script
+      path: script.R
   arguments:
-    - name: "--n_comps"
-      type: integer
-      default: 2
-      description: "Number of components to use for the embedding."
-    - name: t
+    - name: "--n_dim"
       type: integer
-      default: 1
-      description: "Number to power the eigenvalues by."
-    - name: n_retries
-      type: integer
-      default: 1
-      description: "Number of times to retry if the embedding fails, each time adding noise."
-  resources:
-    - type: python_script
-      path: script.py
+      description: Number of dimensions.
+      default: 3
 platforms:
   - type: docker
-    image: ghcr.io/openproblems-bio/base_python:1.0.2
+    image: ghcr.io/openproblems-bio/base_r:1.0.2
     setup:
-      - type: python
-        pypi: 
-          - umap-learn
-          - scipy
-          - numpy
+      - type: r
+        bioc: destiny
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/diffusion_map/script.R b/src/tasks/dimensionality_reduction/methods/diffusion_map/script.R
@@ -0,0 +1,37 @@
+requireNamespace("anndata", quietly = TRUE)
+requireNamespace("diffusionMap", quietly = TRUE)
+
+## VIASH START
+par <- list(
+  input = "resources_test/dimensionality_reduction/pancreas/dataset.h5ad",
+  output = "output.h5ad",
+  n_dim = 3
+)
+## VIASH END
+
+cat("Reading input files\n")
+input <- anndata::read_h5ad(par$input)
+
+cat("Running destiny diffusion map\n")
+# create SummarizedExperiment object
+sce <- SingleCellExperiment::SingleCellExperiment(
+  assays = list(
+    logcounts = t(as.matrix(input$layers[["normalized"]]))
+  )
+)
+dm <- destiny::DiffusionMap(sce)
+X_emb <- destiny::eigenvectors(dm)[, seq_len(par$n_dim)]
+
+cat("Write output AnnData to file\n")
+output <- anndata::AnnData(
+  uns = list(
+    dataset_id = input$uns[["dataset_id"]],
+    normalization_id = input$uns[["normalization_id"]],
+    method_id = meta$functionality_name
+  ),
+  obsm = list(
+    X_emb = X_emb
+  ),
+  shape = input$shape
+)
+output$write_h5ad(par$output, compression = "gzip")
diff --git a/src/tasks/dimensionality_reduction/methods/ivis/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/ivis/config.vsh.yaml
@@ -42,4 +42,4 @@ platforms:
           - ivis[cpu]
   - type: nextflow
     directives:
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/lmds/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/lmds/config.vsh.yaml
@@ -18,7 +18,7 @@ functionality:
     - name: "--n_dim"
       type: integer
       description: Number of dimensions.
-      default: 3
+      default: 2
     - name: "--n_landmarks"
       type: integer
       description: Number of landmarks.

diff --git a/src/tasks/dimensionality_reduction/methods/neuralee/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/neuralee/config.vsh.yaml
@@ -52,4 +52,4 @@ platforms:
           - "git+https://github.com/michalk8/neuralee@8946abf"
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/pca/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/pca/config.vsh.yaml
@@ -12,8 +12,8 @@ functionality:
       is calculated on the logCPM expression matrix with and without selecting 1000
       HVGs.
     reference: pearson1901pca
-    repository_url: "https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html"
-    documentation_url: "https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html"
+    repository_url: https://github.com/scikit-learn/scikit-learn
+    documentation_url: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
     v1:
       path: openproblems/tasks/dimensionality_reduction/methods/pca.py
       commit: 154ccb9fd99113f3d28d9c3f139194539a0290f9
@@ -37,4 +37,4 @@ platforms:
         packages: scanpy
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/phate/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/phate/config.vsh.yaml
@@ -55,4 +55,4 @@ platforms:
           - "scikit-learn<1.2"
   - type: nextflow
     directives: 
-      label: [ "midtime", highmem, highcpu ]
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/pymde/config.vsh.yaml b/src/tasks/dimensionality_reduction/methods/pymde/config.vsh.yaml
@@ -0,0 +1,41 @@
+__merge__: ../../api/comp_method.yaml
+functionality:
+  name: pymde
+  info:
+    label: PyMDE
+    summary: "A Python implementation of Minimum-Distortion Embedding"
+    description: |
+      PyMDE is a Python implementation of Minimum-Distortion Embedding. It is a non-linear
+      method that preserves distances between cells or neighbourhoods in the original space.
+    reference: agrawal2021mde
+    repository_url: https://github.com/cvxgrp/pymde
+    documentation_url: https://pymde.org
+    v1:
+      path: openproblems/tasks/dimensionality_reduction/methods/pymde.py
+      commit: b3456fd73c04c28516f6df34c57e6e3e8b0dab32
+    preferred_normalization: log_cp10k
+  arguments:
+    - name: --embed_method
+      type: string
+      description: The method to use for embedding. Options are 'umap' and 'tsne'.
+      default: neighbors
+      choices: [ neighbors, distances ]
+    - name: --n_hvg
+      type: integer
+      description: Number of highly variable genes to subset to. If not specified, the input matrix will not be subset.
+    - name: --n_pca_dims
+      type: integer
+      description: Number of principal components to use for the initial PCA step.
+      default: 100
+  resources:
+    - type: python_script
+      path: script.py
+platforms:
+  - type: docker
+    image: ghcr.io/openproblems-bio/base_python:1.0.2
+    setup:
+      - type: python
+        packages: pymde
+  - type: nextflow
+    directives: 
+      label: [ midtime, highmem, highcpu ]
diff --git a/src/tasks/dimensionality_reduction/methods/pymde/script.py b/src/tasks/dimensionality_reduction/methods/pymde/script.py
@@ -0,0 +1,59 @@
+import anndata as ad
+import scanpy as sc
+import pymde
+
+## VIASH START
+par = {
+    "input": "resources_test/dimensionality_reduction/pancreas/dataset.h5ad",
+    "output": "reduced.h5ad",
+    "embed_method": "neighbors",
+    "n_hvg": 1000,
+    "n_pca_dims": 50,
+}
+meta = {
+    "functionality_name": "foo",
+}
+## VIASH END
+
+if par["embed_method"] == "neighbors":
+    mde_fn = pymde.preserve_neighbors
+elif par["embed_method"] == "distances":
+    mde_fn = pymde.preserve_distances
+else:
+    raise ValueError(f"Unknown embedding method: {par['embed_method']}")
+
+print("Load input data", flush=True)
+input = ad.read_h5ad(par["input"])
+X_mat = input.layers["normalized"]
+
+if par["n_hvg"]:
+    print(f"Select top {par['n_hvg']} high variable genes", flush=True)
+    idx = input.var["hvg_score"].to_numpy().argsort()[::-1][:par["n_hvg"]]
+    X_mat = X_mat[:, idx]
+
+print(f"Compute PCA", flush=True)
+X_pca = sc.tl.pca(X_mat, n_comps=par["n_pca_dims"], svd_solver="arpack")
+
+print(f"Run MDE", flush=True)
+X_emb = (
+    mde_fn(X_pca, embedding_dim=2, verbose=True)
+    .embed(verbose=True)
+    .detach()
+    .numpy()
+)
+
+print("Create output AnnData", flush=True)
+output = ad.AnnData(
+    obs=input.obs[[]],
+    obsm={
+        "X_emb": X_emb
+    },
+    uns={
+        "dataset_id": input.uns["dataset_id"],
+        "normalization_id": input.uns["normalization_id"],
+        "method_id": meta["functionality_name"]
+    }
+)
+
+print("Write output to file", flush=True)
+output.write_h5ad(par["output"], compression="gzip")