Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
KaiWaldrant committed Jun 28, 2024
1 parent 7da412a commit 7079b24
Showing 1 changed file with 352 additions and 17 deletions.
369 changes: 352 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,362 @@
# Task Template
# Denoising

This repo is a template to create a new task for the OpenProblems v2. This repo contains several example files and components that can be used when updated with the task info.

> [!WARNING]
> This README will be overwritten when performing the `create_task_readme` script.
<!--
This file is automatically generated from the tasks's api/*.yaml files.
Do not edit this file directly.
-->

## Create a repository from this template
Removing noise in sparse single-cell RNA-sequencing count data

> [!IMPORTANT]
> Before creating a new repository, make sure you are part of the openProblems task team. This will be done when you create an issue for the task and you got the go ahead to create the task.
> For more information on how to create a new task, check out the [Create a new task](https://openproblems.bio/documentation/create_task/) documentation.
Path to source:
[`src`](https://github.com/openproblems-bio/task_denoising/src)

The instructions below will guide you through creating a new repository from this template ([creating-a-repository-from-a-template](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template#creating-a-repository-from-a-template)).
## README

## Installation

* Click the "Use this template" button on the top right of the repository.
* Use the Owner dropdown menu to select the `openproblems-bio` account.
* Type a name for your repository (task_...), and a description.
* Set the repository visibility to public.
* Click "Create repository from template".
You need to have Docker, Java, and Viash installed. Follow [these
instructions](https://openproblems.bio/documentation/fundamentals/requirements)
to install the required dependencies.

## What to do next
## Add a method

Check out the [instructions](INSTRUCTIONS.md) for more information on how to update the example files and components. These instructions also contain information on how to build out the task and basic commands.
To add a method to the repository, follow the instructions in the
`scripts/add_a_method.sh` script.

## Frequently used commands

To get started, you can run the following commands:

``` bash
git clone [email protected]:openproblems-bio/task_denoising.git

cd task_denoising

# initialise submodule
scripts/init_submodule.sh

# download resources
scripts/download_resources.sh
```

To run the benchmark, you first need to build the components.
Afterwards, you can run the benchmark:

``` bash
viash ns build --parallel --setup cachedbuild

scripts/run_benchmark.sh
```

After adding a component, it is recommended to run the tests to ensure
that the component is working correctly:

``` bash
viash ns test --parallel
```

Optionally, you can provide the `--query` argument to test only a subset
of components:

``` bash
viash ns test --parallel --query "component_name"
```

## Motivation

Single-cell RNA-Seq protocols only detect a fraction of the mRNA
molecules present in each cell. As a result, the measurements (UMI
counts) observed for each gene and each cell are associated with
generally high levels of technical noise ([Grün et al.,
2014](https://www.nature.com/articles/nmeth.2930)). Denoising describes
the task of estimating the true expression level of each gene in each
cell. In the single-cell literature, this task is also referred to as
*imputation*, a term which is typically used for missing data problems
in statistics. Similar to the use of the terms “dropout”, “missing
data”, and “technical zeros”, this terminology can create confusion
about the underlying measurement process ([Sarkar and Stephens,
2020](https://www.biorxiv.org/content/10.1101/2020.04.07.030007v2)).

## Description

A key challenge in evaluating denoising methods is the general lack of a
ground truth. A recent benchmark study ([Hou et al.,
2020](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02132-x))
relied on flow-sorted datasets, mixture control experiments ([Tian et
al., 2019](https://www.nature.com/articles/s41592-019-0425-8)), and
comparisons with bulk RNA-Seq data. Since each of these approaches
suffers from specific limitations, it is difficult to combine these
different approaches into a single quantitative measure of denoising
accuracy. Here, we instead rely on an approach termed molecular
cross-validation (MCV), which was specifically developed to quantify
denoising accuracy in the absence of a ground truth ([Batson et al.,
2019](https://www.biorxiv.org/content/10.1101/786269v1)). In MCV, the
observed molecules in a given scRNA-Seq dataset are first partitioned
between a *training* and a *test* dataset. Next, a denoising method is
applied to the training dataset. Finally, denoising accuracy is measured
by comparing the result to the test dataset. The authors show that both
in theory and in practice, the measured denoising accuracy is
representative of the accuracy that would be obtained on a ground truth
dataset.

## Authors & contributors

| name | roles |
|:------------------|:-------------------|
| Wesley Lewis | author, maintainer |
| Scott Gigante | author, maintainer |
| Robrecht Cannoodt | author |
| Kai Waldrant | contributor |

## API

``` mermaid
flowchart LR
file_common_dataset("Common Dataset")
comp_process_dataset[/"Data processor"/]
file_train_h5ad("Training data")
file_test_h5ad("Test data")
comp_control_method[/"Control Method"/]
comp_method[/"Method"/]
comp_metric[/"Metric"/]
file_prediction("Denoised data")
file_score("Score")
file_common_dataset---comp_process_dataset
comp_process_dataset-->file_train_h5ad
comp_process_dataset-->file_test_h5ad
file_train_h5ad---comp_control_method
file_train_h5ad---comp_method
file_test_h5ad---comp_control_method
file_test_h5ad---comp_metric
comp_control_method-->file_prediction
comp_method-->file_prediction
comp_metric-->file_score
file_prediction---comp_metric
```

## File format: Common Dataset

A subset of the common dataset.

Example file: `resources_test/common/pancreas/dataset.h5ad`

Format:

<div class="small">

AnnData object
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

</div>

Slot description:

<div class="small">

| Slot | Type | Description |
|:-----------------------------|:----------|:-------------------------------------------------------------------------------|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["dataset_name"]` | `string` | Nicely formatted name. |
| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |
| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
| `uns["dataset_description"]` | `string` | Long description of the dataset. |
| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |

</div>

## Component type: Data processor

Path:
[`src/`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/)

A denoising dataset processor.

Arguments:

<div class="small">

| Name | Type | Description |
|:-----------------|:-------|:------------------------------------------------------------------|
| `--input` | `file` | A subset of the common dataset. |
| `--output_train` | `file` | (*Output*) The subset of molecules used for the training dataset. |
| `--output_test` | `file` | (*Output*) The subset of molecules used for the test dataset. |

</div>

## File format: Training data

The subset of molecules used for the training dataset

Example file: `resources_test/denoising/pancreas/train.h5ad`

Format:

<div class="small">

AnnData object
layers: 'counts'
uns: 'dataset_id'

</div>

Slot description:

<div class="small">

| Slot | Type | Description |
|:--------------------|:----------|:-------------------------------------|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

</div>

## File format: Test data

The subset of molecules used for the test dataset

Example file: `resources_test/denoising/pancreas/test.h5ad`

Format:

<div class="small">

AnnData object
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'

</div>

Slot description:

<div class="small">

| Slot | Type | Description |
|:-----------------------------|:----------|:-------------------------------------------------------------------------------|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["dataset_name"]` | `string` | Nicely formatted name. |
| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |
| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
| `uns["dataset_description"]` | `string` | Long description of the dataset. |
| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |
| `uns["train_sum"]` | `integer` | The total number of counts in the training dataset. |

</div>

## Component type: Control Method

Path:
[`src/control_methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/control_methods)

A control method.

Arguments:

<div class="small">

| Name | Type | Description |
|:----------------|:-------|:-------------------------------------------------------|
| `--input_train` | `file` | The subset of molecules used for the training dataset. |
| `--input_test` | `file` | The subset of molecules used for the test dataset. |
| `--output` | `file` | (*Output*) A denoised dataset as output by a method. |

</div>

## Component type: Method

Path:
[`src/methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/methods)

A method.

Arguments:

<div class="small">

| Name | Type | Description |
|:----------------|:-------|:-------------------------------------------------------|
| `--input_train` | `file` | The subset of molecules used for the training dataset. |
| `--output` | `file` | (*Output*) A denoised dataset as output by a method. |

</div>

## Component type: Metric

Path:
[`src/metrics`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/metrics)

A metric.

Arguments:

<div class="small">

| Name | Type | Description |
|:---------------------|:-------|:---------------------------------------------------|
| `--input_test` | `file` | The subset of molecules used for the test dataset. |
| `--input_prediction` | `file` | A denoised dataset as output by a method. |
| `--output` | `file` | (*Output*) File indicating the score of a metric. |

</div>

## File format: Denoised data

A denoised dataset as output by a method.

Example file: `resources_test/denoising/pancreas/denoised.h5ad`

Format:

<div class="small">

AnnData object
layers: 'denoised'
uns: 'dataset_id', 'method_id'

</div>

Slot description:

<div class="small">

| Slot | Type | Description |
|:---------------------|:----------|:-------------------------------------|
| `layers["denoised"]` | `integer` | denoised data. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["method_id"]` | `string` | A unique identifier for the method. |

</div>

## File format: Score

File indicating the score of a metric.

Example file: `resources_test/denoising/pancreas/score.h5ad`

Format:

<div class="small">

AnnData object
uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

</div>

Slot description:

<div class="small">

| Slot | Type | Description |
|:-----------------------|:---------|:---------------------------------------------------------------------------------------------|
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["method_id"]` | `string` | A unique identifier for the method. |
| `uns["metric_ids"]` | `string` | One or more unique metric identifiers. |
| `uns["metric_values"]` | `double` | The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |

</div>

For more information on the OpenProblems v2, check out the [Documentation](https://openproblems.bio/documentation/) on the Open Problems website.

0 comments on commit 7079b24

Please sign in to comment.