Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into feature/ref-2/fix-ben…
Browse files Browse the repository at this point in the history
…chmark-results
  • Loading branch information
KaiWaldrant committed Sep 19, 2024
2 parents 1c3d189 + 77fa24b commit ac70d4a
Show file tree
Hide file tree
Showing 44 changed files with 420 additions and 365 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,20 @@

* Directory structure has been updated.

* Update to viash 0.9.0 (PR #13).

## NEW FUNCTIONALITY

* Add `CHANGELOG.md` (PR #7).

* Update `process_dataset` component to subsample large datasets (PR #14).

## MAJOR CHANGES

* Revamp `scripts` directory (PR #13).

* Relocated `process_datasets` to `data_processors/process_datasets` (PR #13).

## MINOR CHANGES

* Remove dtype parameter in `.Anndata()` (PR #6).
Expand All @@ -22,6 +30,11 @@

* Update docker containers used in components (PR #12).

* Set `numpy<2` for some failing methods (PR #13).

* Small changes to api file names (PR #13).


## transfer from openproblems-v2 repository

### NEW FUNCTIONALITY
Expand Down
181 changes: 51 additions & 130 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,75 +8,8 @@ Do not edit this file directly.

Removing noise in sparse single-cell RNA-sequencing count data

Path to source:
[`src`](https://github.com/openproblems-bio/task_denoising/src)

## README

## Installation

You need to have Docker, Java, and Viash installed. Follow [these
instructions](https://openproblems.bio/documentation/fundamentals/requirements)
to install the required dependencies.

## Add a method

To add a method to the repository, follow the instructions in the
`scripts/add_a_method.sh` script.

## Frequently used commands

To get started, you can run the following commands:

``` bash
git clone [email protected]:openproblems-bio/task_denoising.git

cd task_denoising

# initialise submodule
scripts/init_submodule.sh

# download resources
scripts/download_resources.sh
```

To run the benchmark, you first need to build the components.
Afterwards, you can run the benchmark:

``` bash
viash ns build --parallel --setup cachedbuild

scripts/run_benchmark.sh
```

After adding a component, it is recommended to run the tests to ensure
that the component is working correctly:

``` bash
viash ns test --parallel
```

Optionally, you can provide the `--query` argument to test only a subset
of components:

``` bash
viash ns test --parallel --query 'component_name'
```

## Motivation

Single-cell RNA-Seq protocols only detect a fraction of the mRNA
molecules present in each cell. As a result, the measurements (UMI
counts) observed for each gene and each cell are associated with
generally high levels of technical noise ([Grün et al.,
2014](https://www.nature.com/articles/nmeth.2930)). Denoising describes
the task of estimating the true expression level of each gene in each
cell. In the single-cell literature, this task is also referred to as
*imputation*, a term which is typically used for missing data problems
in statistics. Similar to the use of the terms “dropout”, “missing
data”, and “technical zeros”, this terminology can create confusion
about the underlying measurement process ([Sarkar and Stephens,
2020](https://www.biorxiv.org/content/10.1101/2020.04.07.030007v2)).
Repository:
[openproblems-bio/task_denoising](https://github.com/openproblems-bio/task_denoising)

## Description

Expand Down Expand Up @@ -114,24 +47,24 @@ dataset.
``` mermaid
flowchart LR
file_common_dataset("Common Dataset")
comp_process_dataset[/"Data processor"/]
file_train_h5ad("Training data")
file_test_h5ad("Test data")
comp_data_processor[/"Data processor"/]
file_test("Test data")
file_train("Training data")
comp_control_method[/"Control Method"/]
comp_method[/"Method"/]
comp_metric[/"Metric"/]
comp_method[/"Method"/]
file_prediction("Denoised data")
file_score("Score")
file_common_dataset---comp_process_dataset
comp_process_dataset-->file_train_h5ad
comp_process_dataset-->file_test_h5ad
file_train_h5ad---comp_control_method
file_train_h5ad---comp_method
file_test_h5ad---comp_control_method
file_test_h5ad---comp_metric
file_common_dataset---comp_data_processor
comp_data_processor-->file_test
comp_data_processor-->file_train
file_test---comp_control_method
file_test---comp_metric
file_train---comp_control_method
file_train---comp_method
comp_control_method-->file_prediction
comp_method-->file_prediction
comp_metric-->file_score
comp_method-->file_prediction
file_prediction---comp_metric
```

Expand All @@ -151,7 +84,7 @@ Format:

</div>

Slot description:
Data structure:

<div class="small">

Expand All @@ -170,9 +103,6 @@ Slot description:

## Component type: Data processor

Path:
[`src/process_dataset`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/process_dataset)

A denoising dataset processor.

Arguments:
Expand All @@ -187,72 +117,69 @@ Arguments:

</div>

## File format: Training data
## File format: Test data

The subset of molecules used for the training dataset
The subset of molecules used for the test dataset

Example file: `resources_test/denoising/pancreas/train.h5ad`
Example file: `resources_test/denoising/pancreas/test.h5ad`

Format:

<div class="small">

AnnData object
layers: 'counts'
uns: 'dataset_id'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'

</div>

Slot description:
Data structure:

<div class="small">

| Slot | Type | Description |
|:--------------------|:----------|:-------------------------------------|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| Slot | Type | Description |
|:---|:---|:---|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["dataset_name"]` | `string` | Nicely formatted name. |
| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |
| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
| `uns["dataset_description"]` | `string` | Long description of the dataset. |
| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |
| `uns["train_sum"]` | `integer` | The total number of counts in the training dataset. |

</div>

## File format: Test data
## File format: Training data

The subset of molecules used for the test dataset
The subset of molecules used for the training dataset

Example file: `resources_test/denoising/pancreas/test.h5ad`
Example file: `resources_test/denoising/pancreas/train.h5ad`

Format:

<div class="small">

AnnData object
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'
uns: 'dataset_id'

</div>

Slot description:
Data structure:

<div class="small">

| Slot | Type | Description |
|:---|:---|:---|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
| `uns["dataset_name"]` | `string` | Nicely formatted name. |
| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |
| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |
| `uns["dataset_summary"]` | `string` | Short description of the dataset. |
| `uns["dataset_description"]` | `string` | Long description of the dataset. |
| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |
| `uns["train_sum"]` | `integer` | The total number of counts in the training dataset. |
| Slot | Type | Description |
|:--------------------|:----------|:-------------------------------------|
| `layers["counts"]` | `integer` | Raw counts. |
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

</div>

## Component type: Control Method

Path:
[`src/control_methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/control_methods)

A control method.

Arguments:
Expand All @@ -267,40 +194,34 @@ Arguments:

</div>

## Component type: Method

Path:
[`src/methods`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/methods)
## Component type: Metric

A method.
A metric.

Arguments:

<div class="small">

| Name | Type | Description |
|:---|:---|:---|
| `--input_train` | `file` | The subset of molecules used for the training dataset. |
| `--output` | `file` | (*Output*) A denoised dataset as output by a method. |
| `--input_test` | `file` | The subset of molecules used for the test dataset. |
| `--input_prediction` | `file` | A denoised dataset as output by a method. |
| `--output` | `file` | (*Output*) File indicating the score of a metric. |

</div>

## Component type: Metric

Path:
[`src/metrics`](https://github.com/openproblems-bio/openproblems-v2/tree/main/src/metrics)
## Component type: Method

A metric.
A method.

Arguments:

<div class="small">

| Name | Type | Description |
|:---|:---|:---|
| `--input_test` | `file` | The subset of molecules used for the test dataset. |
| `--input_prediction` | `file` | A denoised dataset as output by a method. |
| `--output` | `file` | (*Output*) File indicating the score of a metric. |
| `--input_train` | `file` | The subset of molecules used for the training dataset. |
| `--output` | `file` | (*Output*) A denoised dataset as output by a method. |

</div>

Expand All @@ -320,7 +241,7 @@ Format:

</div>

Slot description:
Data structure:

<div class="small">

Expand All @@ -347,7 +268,7 @@ Format:

</div>

Slot description:
Data structure:

<div class="small">

Expand Down
Loading

0 comments on commit ac70d4a

Please sign in to comment.