Update readme

openproblems-bio · Jun 28, 2024 · 7079b24 · 7079b24
1 parent 7da412a
commit 7079b24
Showing 1 changed file with 352 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -1,27 +1,362 @@
-# Task Template
+# Denoising
 
-This repo is a template to create a new task for the OpenProblems v2. This repo contains several example files and components that can be used when updated with the task info.
 
-> [!WARNING] 
-> This README will be overwritten when performing the `create_task_readme` script.
+<!--
+This file is automatically generated from the tasks's api/*.yaml files.
+Do not edit this file directly.
+-->
 
-## Create a repository from this template
+Removing noise in sparse single-cell RNA-sequencing count data
 
-> [!IMPORTANT] 
-> Before creating a new repository, make sure you are part of the openProblems task team. This will be done when you create an issue for the task and you got the go ahead to create the task.
-> For more information on how to create a new task, check out the [Create a new task](https://openproblems.bio/documentation/create_task/) documentation.
+Path to source:
+[`src`](https://github.com/openproblems-bio/task_denoising/src)
 
-The instructions below will guide you through creating a new repository from this template ([creating-a-repository-from-a-template](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template#creating-a-repository-from-a-template)).
+## README
 
+## Installation
 
-* Click the "Use this template" button on the top right of the repository.
-* Use the Owner dropdown menu to select the `openproblems-bio` account.
-* Type a name for your repository (task_...), and a description.
-* Set the repository visibility to public.
-* Click "Create repository from template".
+You need to have Docker, Java, and Viash installed. Follow [these
+instructions](https://openproblems.bio/documentation/fundamentals/requirements)
+to install the required dependencies.
 
-## What to do next
+## Add a method
 
-Check out the [instructions](INSTRUCTIONS.md) for more information on how to update the example files and components. These instructions also contain information on how to build out the task and basic commands.
+To add a method to the repository, follow the instructions in the
+`scripts/add_a_method.sh` script.
+
+## Frequently used commands
+
+To get started, you can run the following commands:
+
+``` bash
+git clone [email protected]:openproblems-bio/task_denoising.git
+
+cd task_denoising
+
+# initialise submodule
+scripts/init_submodule.sh
+
+# download resources
+scripts/download_resources.sh
+```
+
+To run the benchmark, you first need to build the components.
+Afterwards, you can run the benchmark:
+
+``` bash
+viash ns build --parallel --setup cachedbuild
+
+scripts/run_benchmark.sh
+```
+
+After adding a component, it is recommended to run the tests to ensure
+that the component is working correctly:
+
+``` bash
+viash ns test --parallel
+```
+
+Optionally, you can provide the `--query` argument to test only a subset
+of components:
+
+``` bash
+viash ns test --parallel --query "component_name"
+```
+
+## Motivation
+
+Single-cell RNA-Seq protocols only detect a fraction of the mRNA
+molecules present in each cell. As a result, the measurements (UMI
+counts) observed for each gene and each cell are associated with
+generally high levels of technical noise ([Grün et al.,
+2014](https://www.nature.com/articles/nmeth.2930)). Denoising describes
+the task of estimating the true expression level of each gene in each
+cell. In the single-cell literature, this task is also referred to as
+*imputation*, a term which is typically used for missing data problems
+in statistics. Similar to the use of the terms “dropout”, “missing
+data”, and “technical zeros”, this terminology can create confusion
+about the underlying measurement process ([Sarkar and Stephens,
+2020](https://www.biorxiv.org/content/10.1101/2020.04.07.030007v2)).
+
+## Description
+
+A key challenge in evaluating denoising methods is the general lack of a
+ground truth. A recent benchmark study ([Hou et al.,
+2020](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02132-x))
+relied on flow-sorted datasets, mixture control experiments ([Tian et
+al., 2019](https://www.nature.com/articles/s41592-019-0425-8)), and
+comparisons with bulk RNA-Seq data. Since each of these approaches
+suffers from specific limitations, it is difficult to combine these
+different approaches into a single quantitative measure of denoising
+accuracy. Here, we instead rely on an approach termed molecular
+cross-validation (MCV), which was specifically developed to quantify
+denoising accuracy in the absence of a ground truth ([Batson et al.,
+2019](https://www.biorxiv.org/content/10.1101/786269v1)). In MCV, the
+observed molecules in a given scRNA-Seq dataset are first partitioned
+between a *training* and a *test* dataset. Next, a denoising method is
+applied to the training dataset. Finally, denoising accuracy is measured
+by comparing the result to the test dataset. The authors show that both
+in theory and in practice, the measured denoising accuracy is
+representative of the accuracy that would be obtained on a ground truth
+dataset.
+
+## Authors & contributors
+
+| name              | roles              |
+|:------------------|:-------------------|
+| Wesley Lewis      | author, maintainer |
+| Scott Gigante     | author, maintainer |
+| Robrecht Cannoodt | author             |
+| Kai Waldrant      | contributor        |
+
+## API
+
+``` mermaid
+flowchart LR
+  file_common_dataset("Common Dataset")
+  comp_process_dataset[/"Data processor"/]
+  file_train_h5ad("Training data")
+  file_test_h5ad("Test data")
+  comp_control_method[/"Control Method"/]
+  comp_method[/"Method"/]
+  comp_metric[/"Metric"/]
+  file_prediction("Denoised data")
+  file_score("Score")
+  file_common_dataset---comp_process_dataset
+  comp_process_dataset-->file_train_h5ad
+  comp_process_dataset-->file_test_h5ad
+  file_train_h5ad---comp_control_method
+  file_train_h5ad---comp_method
+  file_test_h5ad---comp_control_method
+  file_test_h5ad---comp_metric
+  comp_control_method-->file_prediction
+  comp_method-->file_prediction
+  comp_metric-->file_score
+  file_prediction---comp_metric
+```
+
+## File format: Common Dataset
+
+A subset of the common dataset.
+
+Example file: `resources_test/common/pancreas/dataset.h5ad`
+
+Format:
+
+<div class="small">
+
+    AnnData object
+     layers: 'counts'
+     uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
+
+</div>
+
+Slot description:
+
+<div class="small">
+
+| Slot                         | Type      | Description                                                                    |
+|:-----------------------------|:----------|:-------------------------------------------------------------------------------|
+| `layers["counts"]`           | `integer` | Raw counts.                                                                    |
+| `uns["dataset_id"]`          | `string`  | A unique identifier for the dataset.                                           |
+| `uns["dataset_name"]`        | `string`  | Nicely formatted name.                                                         |
+| `uns["dataset_url"]`         | `string`  | (*Optional*) Link to the original source of the dataset.                       |
+| `uns["dataset_reference"]`   | `string`  | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
+| `uns["dataset_summary"]`     | `string`  | Short description of the dataset.                                              |
+| `uns["dataset_description"]` | `string`  | Long description of the dataset.                                               |
+| `uns["dataset_organism"]`    | `string`  | (*Optional*) The organism of the sample in the dataset.                        |
+
+</div>
+
+## Component type: Data processor
+
+Path:
+[`src/`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/)
+
+A denoising dataset processor.
+
+Arguments:
+
+<div class="small">
+
+| Name             | Type   | Description                                                       |
+|:-----------------|:-------|:------------------------------------------------------------------|
+| `--input`        | `file` | A subset of the common dataset.                                   |
+| `--output_train` | `file` | (*Output*) The subset of molecules used for the training dataset. |
+| `--output_test`  | `file` | (*Output*) The subset of molecules used for the test dataset.     |
+
+</div>
+
+## File format: Training data
+
+The subset of molecules used for the training dataset
+
+Example file: `resources_test/denoising/pancreas/train.h5ad`
+
+Format:
+
+<div class="small">
+
+    AnnData object
+     layers: 'counts'
+     uns: 'dataset_id'
+
+</div>
+
+Slot description:
+
+<div class="small">
+
+| Slot                | Type      | Description                          |
+|:--------------------|:----------|:-------------------------------------|
+| `layers["counts"]`  | `integer` | Raw counts.                          |
+| `uns["dataset_id"]` | `string`  | A unique identifier for the dataset. |
+
+</div>
+
+## File format: Test data
+
+The subset of molecules used for the test dataset
+
+Example file: `resources_test/denoising/pancreas/test.h5ad`
+
+Format:
+
+<div class="small">
+
+    AnnData object
+     layers: 'counts'
+     uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'
+
+</div>
+
+Slot description:
+
+<div class="small">
+
+| Slot                         | Type      | Description                                                                    |
+|:-----------------------------|:----------|:-------------------------------------------------------------------------------|
+| `layers["counts"]`           | `integer` | Raw counts.                                                                    |
+| `uns["dataset_id"]`          | `string`  | A unique identifier for the dataset.                                           |
+| `uns["dataset_name"]`        | `string`  | Nicely formatted name.                                                         |
+| `uns["dataset_url"]`         | `string`  | (*Optional*) Link to the original source of the dataset.                       |
+| `uns["dataset_reference"]`   | `string`  | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
+| `uns["dataset_summary"]`     | `string`  | Short description of the dataset.                                              |
+| `uns["dataset_description"]` | `string`  | Long description of the dataset.                                               |
+| `uns["dataset_organism"]`    | `string`  | (*Optional*) The organism of the sample in the dataset.                        |
+| `uns["train_sum"]`           | `integer` | The total number of counts in the training dataset.                            |
+
+</div>
+
+## Component type: Control Method
+
+Path:
+[`src/control_methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/control_methods)
+
+A control method.
+
+Arguments:
+
+<div class="small">
+
+| Name            | Type   | Description                                            |
+|:----------------|:-------|:-------------------------------------------------------|
+| `--input_train` | `file` | The subset of molecules used for the training dataset. |
+| `--input_test`  | `file` | The subset of molecules used for the test dataset.     |
+| `--output`      | `file` | (*Output*) A denoised dataset as output by a method.   |
+
+</div>
+
+## Component type: Method
+
+Path:
+[`src/methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/methods)
+
+A method.
+
+Arguments:
+
+<div class="small">
+
+| Name            | Type   | Description                                            |
+|:----------------|:-------|:-------------------------------------------------------|
+| `--input_train` | `file` | The subset of molecules used for the training dataset. |
+| `--output`      | `file` | (*Output*) A denoised dataset as output by a method.   |
+
+</div>
+
+## Component type: Metric
+
+Path:
+[`src/metrics`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/metrics)
+
+A metric.
+
+Arguments:
+
+<div class="small">
+
+| Name                 | Type   | Description                                        |
+|:---------------------|:-------|:---------------------------------------------------|
+| `--input_test`       | `file` | The subset of molecules used for the test dataset. |
+| `--input_prediction` | `file` | A denoised dataset as output by a method.          |
+| `--output`           | `file` | (*Output*) File indicating the score of a metric.  |
+
+</div>
+
+## File format: Denoised data
+
+A denoised dataset as output by a method.
+
+Example file: `resources_test/denoising/pancreas/denoised.h5ad`
+
+Format:
+
+<div class="small">
+
+    AnnData object
+     layers: 'denoised'
+     uns: 'dataset_id', 'method_id'
+
+</div>
+
+Slot description:
+
+<div class="small">
+
+| Slot                 | Type      | Description                          |
+|:---------------------|:----------|:-------------------------------------|
+| `layers["denoised"]` | `integer` | denoised data.                       |
+| `uns["dataset_id"]`  | `string`  | A unique identifier for the dataset. |
+| `uns["method_id"]`   | `string`  | A unique identifier for the method.  |
+
+</div>
+
+## File format: Score
+
+File indicating the score of a metric.
+
+Example file: `resources_test/denoising/pancreas/score.h5ad`
+
+Format:
+
+<div class="small">
+
+    AnnData object
+     uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'
+
+</div>
+
+Slot description:
+
+<div class="small">
+
+| Slot                   | Type     | Description                                                                                  |
+|:-----------------------|:---------|:---------------------------------------------------------------------------------------------|
+| `uns["dataset_id"]`    | `string` | A unique identifier for the dataset.                                                         |
+| `uns["method_id"]`     | `string` | A unique identifier for the method.                                                          |
+| `uns["metric_ids"]`    | `string` | One or more unique metric identifiers.                                                       |
+| `uns["metric_values"]` | `double` | The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |
+
+</div>
 
-For more information on the OpenProblems v2, check out the [Documentation](https://openproblems.bio/documentation/) on the Open Problems website.