Skip to content

Commit

Permalink
Prepare task for adding foundation models (#24)
Browse files Browse the repository at this point in the history
* Update common submodule

* Use checkItemAllowed() for benchmark method check

* Replace cxg_mouse_pancreas_atlas with cxg_immune_cell_atlas

* Update README

* Update CHANGELOG

* Add a base method API schema

* Update CHANGELOG

* Add config check to base method schema

* Add dataset_organism to training dataset files
  • Loading branch information
lazappi authored Dec 9, 2024
1 parent bfa2730 commit 9c77313
Show file tree
Hide file tree
Showing 27 changed files with 141 additions and 100 deletions.
18 changes: 15 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,18 @@

* Update workflows to use core repository dependency (PR #20).

* Update the `common` submodule (PR #24)

* Use the common `checkItemAllowed()` for the method check in the benchmark workflow (PR #24)

* Use the `cxg_immune_cell_atlas` dataset instead of the `cxg_mouse_pancreas_atlas` for testing (PR #24)

* Update `README` (PR #24)

* Add a base method API schema (PR #24)

* Add `dataset_organism` to training input files (PR #24)

## BUG FIXES

* Update the nextflow workflow dependencies (PR #17).
Expand All @@ -57,7 +69,7 @@
* `process_dataset`: Added a component for processing common datasets into task-ready dataset objects.

* `resources_test/denoising/pancreas` with `src/tasks/denoising/resources_test_scripts/pancreas.sh`.

* `workflows/run`: Added nf-tower test script. (PR #205)

### V1 MIGRATION
Expand All @@ -81,7 +93,7 @@
### Changes from V1

* Anndata layers are used to store data instead of obsm

* extended the use of sparse data in methods unless it was not possible

* process_dataset also removes unnecessary data from train and test datasets not needed by the methods and metrics.
* process_dataset also removes unnecessary data from train and test datasets not needed by the methods and metrics.
31 changes: 15 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,16 @@ dataset.
## API

``` mermaid
flowchart LR
file_common_dataset("Common Dataset")
comp_data_processor[/"Data processor"/]
file_test("Test data")
file_train("Training data")
comp_control_method[/"Control Method"/]
comp_metric[/"Metric"/]
comp_method[/"Method"/]
file_prediction("Denoised data")
file_score("Score")
flowchart TB
file_common_dataset("<a href='https://github.com/openproblems-bio/task_denoising#file-format-common-dataset'>Common Dataset</a>")
comp_data_processor[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-data-processor'>Data processor</a>"/]
file_test("<a href='https://github.com/openproblems-bio/task_denoising#file-format-test-data'>Test data</a>")
file_train("<a href='https://github.com/openproblems-bio/task_denoising#file-format-training-data'>Training data</a>")
comp_control_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-control-method'>Control Method</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-metric'>Metric</a>"/]
comp_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-method'>Method</a>"/]
file_prediction("<a href='https://github.com/openproblems-bio/task_denoising#file-format-denoised-data'>Denoised data</a>")
file_score("<a href='https://github.com/openproblems-bio/task_denoising#file-format-score'>Score</a>")
file_common_dataset---comp_data_processor
comp_data_processor-->file_test
comp_data_processor-->file_train
Expand All @@ -72,8 +72,7 @@ flowchart LR

A subset of the common dataset.

Example file:
`resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad`
Example file: `resources_test/common/cxg_immune_cell_atlas/dataset.h5ad`

Format:

Expand Down Expand Up @@ -125,7 +124,7 @@ Arguments:
The subset of molecules used for the test dataset

Example file:
`resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad`
`resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad`

Format:

Expand Down Expand Up @@ -160,7 +159,7 @@ Data structure:
The subset of molecules used for the training dataset

Example file:
`resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad`
`resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad`

Format:

Expand Down Expand Up @@ -235,7 +234,7 @@ Arguments:
A denoised dataset as output by a method.

Example file:
`resources_test/task_denoising/cxg_mouse_pancreas_atlas/denoised.h5ad`
`resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad`

Format:

Expand Down Expand Up @@ -264,7 +263,7 @@ Data structure:
File indicating the score of a metric.

Example file:
`resources_test/task_denoising/cxg_mouse_pancreas_atlas/score.h5ad`
`resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad`

Format:

Expand Down
2 changes: 1 addition & 1 deletion common
22 changes: 11 additions & 11 deletions scripts/create_resources/test_resources.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,31 +15,31 @@ mkdir -p $DATASET_DIR

# process dataset
viash run src/data_processors/process_dataset/config.vsh.yaml -- \
--input $RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad \
--output_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
--output_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad
--input $RAW_DATA/cxg_immune_cell_atlas/dataset.h5ad \
--output_train $DATASET_DIR/cxg_immune_cell_atlas/train.h5ad \
--output_test $DATASET_DIR/cxg_immune_cell_atlas/test.h5ad

# run one method
viash run src/methods/magic/config.vsh.yaml -- \
--input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
--output $DATASET_DIR/cxg_mouse_pancreas_atlas/denoised.h5ad
--input_train $DATASET_DIR/cxg_immune_cell_atlas/train.h5ad \
--output $DATASET_DIR/cxg_immune_cell_atlas/denoised.h5ad

# run one metric
viash run src/metrics/poisson/config.vsh.yaml -- \
--input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/denoised.h5ad \
--input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
--output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad
--input_prediction $DATASET_DIR/cxg_immune_cell_atlas/denoised.h5ad \
--input_test $DATASET_DIR/cxg_immune_cell_atlas/test.h5ad \
--output $DATASET_DIR/cxg_immune_cell_atlas/score.h5ad

# write manual state.yaml. this is not actually necessary but you never know it might be useful
cat > $DATASET_DIR/cxg_mouse_pancreas_atlas/state.yaml << HERE
id: cxg_mouse_pancreas_atlas
cat > $DATASET_DIR/cxg_immune_cell_atlas/state.yaml << HERE
id: cxg_immune_cell_atlas
train: !file train.h5ad
test: !file test.h5ad
prediction: !file denoised.h5ad
score: !file score.h5ad
HERE

# only run this if you have access to the openproblems-data bucket
aws s3 sync --profile OP \
aws s3 sync --profile op \
"$DATASET_DIR" s3://openproblems-data/resources_test/task_denoising \
--delete --dryrun
6 changes: 3 additions & 3 deletions scripts/run_benchmark/run_test_local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ nextflow run . \
-profile docker \
-resume \
-c common/nextflow_helpers/labels_ci.config \
--id cxg_mouse_pancreas_atlas \
--input_train resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad \
--input_test resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad \
--id cxg_immune_cell_atlas \
--input_train resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad \
--input_test resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad \
--output_state state.yaml \
--publish_dir "$publish_dir"
6 changes: 3 additions & 3 deletions scripts/run_benchmark/run_test_seqeracloud.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ publish_dir_s3="s3://openproblems-nextflow/temp/results/task_denoising/$(date +%

# write the parameters to file
cat > /tmp/params.yaml << HERE
id: cxg_mouse_pancreas_atlas
input_train: $resources_test_s3/cxg_mouse_pancreas_atlas/train.h5ad
input_test: $resources_test_s3/cxg_mouse_pancreas_atlas/test.h5ad
id: cxg_immune_cell_atlas
input_train: $resources_test_s3/cxg_immune_cell_atlas/train.h5ad
input_test: $resources_test_s3/cxg_immune_cell_atlas/test.h5ad
output_state: "state.yaml"
publish_dir: $publish_dir_s3
HERE
Expand Down
20 changes: 20 additions & 0 deletions src/api/base_method.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
namespace: "methods"
info:
type: method
type_info:
label: Method
summary: A method.
description: |
A denoising method to remove noise (i.e. technical artifacts) from a dataset.
arguments:
- name: --input_train
__merge__: file_train.yaml
required: true
direction: input
- name: --output
__merge__: file_prediction.yaml
required: true
direction: output
test_resources:
- type: python_script
path: /common/component_tests/check_config.py
6 changes: 3 additions & 3 deletions src/api/comp_control_method.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ info:
but also receive the solution object as input. It serves as a
starting point to test the relative accuracy of new methods in
the task, and also as a quality control for the metrics defined
in the task.
in the task.
arguments:
- name: --input_train
__merge__: file_train.yaml
Expand All @@ -29,5 +29,5 @@ test_resources:
- type: python_script
path: /common/component_tests/check_config.py
- path: /common/library.bib
- path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
- path: /resources_test/task_denoising/cxg_immune_cell_atlas
dest: resources_test/task_denoising/cxg_immune_cell_atlas
4 changes: 2 additions & 2 deletions src/api/comp_data_processor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ arguments:
test_resources:
- type: python_script
path: /common/component_tests/run_and_check_output.py
- path: /resources_test/common/cxg_mouse_pancreas_atlas
dest: resources_test/common/cxg_mouse_pancreas_atlas
- path: /resources_test/common/cxg_immune_cell_atlas
dest: resources_test/common/cxg_immune_cell_atlas
22 changes: 3 additions & 19 deletions src/api/comp_method.yaml
Original file line number Diff line number Diff line change
@@ -1,25 +1,9 @@
namespace: "methods"
info:
type: method
type_info:
label: Method
summary: A method.
description: |
A denoising method to remove noise (i.e. technical artifacts) from a dataset.
arguments:
- name: --input_train
__merge__: file_train.yaml
required: true
direction: input
- name: --output
__merge__: file_prediction.yaml
required: true
direction: output
__merge__: base_method.yaml
test_resources:
- type: python_script
path: /common/component_tests/run_and_check_output.py
- type: python_script
path: /common/component_tests/check_config.py
- path: /common/library.bib
- path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
- path: /resources_test/task_denoising/cxg_immune_cell_atlas
dest: resources_test/task_denoising/cxg_immune_cell_atlas
4 changes: 2 additions & 2 deletions src/api/comp_metric.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ test_resources:
- type: python_script
path: /common/component_tests/run_and_check_output.py
- path: /common/library.bib
- path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
- path: /resources_test/task_denoising/cxg_immune_cell_atlas
dest: resources_test/task_denoising/cxg_immune_cell_atlas
6 changes: 3 additions & 3 deletions src/api/file_common_dataset.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
type: file
example: "resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad"
example: "resources_test/common/cxg_immune_cell_atlas/dataset.h5ad"
label: "Common Dataset"
summary: A subset of the common dataset.
info:
format:
type: h5ad
layers:
layers:
- type: integer
name: counts
description: Raw counts
Expand All @@ -15,7 +15,7 @@ info:
name: batch
description: Batch information
required: false

uns:
- type: string
name: dataset_id
Expand Down
4 changes: 2 additions & 2 deletions src/api/file_prediction.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
type: file
example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/denoised.h5ad"
example: "resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad"
label: "Denoised data"
summary: A denoised dataset as output by a method.
info:
Expand All @@ -18,4 +18,4 @@ info:
- type: string
name: method_id
description: "A unique identifier for the method"
required: true
required: true
4 changes: 2 additions & 2 deletions src/api/file_score.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
type: file
example: resources_test/task_denoising/cxg_mouse_pancreas_atlas/score.h5ad
example: resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad
label: Score
summary: "File indicating the score of a metric."
info:
Expand All @@ -23,4 +23,4 @@ info:
name: metric_values
description: "The metric values obtained for the given prediction. Must be of same length as 'metric_ids'."
multiple: true
required: true
required: true
6 changes: 3 additions & 3 deletions src/api/file_test.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
type: file
example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad"
example: "resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad"
label: "Test data"
summary: The subset of molecules used for the test dataset
info:
format:
type: h5ad
layers:
layers:
- type: integer
name: counts
description: Raw counts
Expand Down Expand Up @@ -42,4 +42,4 @@ info:
- name: train_sum
type: integer
description: The total number of counts in the training dataset.
required: true
required: true
10 changes: 7 additions & 3 deletions src/api/file_train.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
type: file
example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad"
example: "resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad"
label: "Training data"
summary: The subset of molecules used for the training dataset
info:
format:
type: h5ad
layers:
layers:
- type: integer
name: counts
description: Raw counts
Expand All @@ -14,4 +14,8 @@ info:
- type: string
name: dataset_id
description: "A unique identifier for the dataset"
required: true
required: true
- name: dataset_organism
type: string
description: The organism of the sample in the dataset.
required: false
4 changes: 2 additions & 2 deletions src/control_methods/perfect_denoising/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

## VIASH START
par = {
'input_train': 'resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad',
'input_test': 'resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad',
'input_train': 'resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad',
'input_test': 'resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad',
'output': 'output_PD.h5ad',
}
meta = {
Expand Down
5 changes: 3 additions & 2 deletions src/data_processors/process_dataset/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
obs_filt = np.ones(dtype=np.bool_, shape=adata_output.n_obs)
obs_index = np.random.choice(np.where(obs_filt)[0], par["n_obs_limit"], replace=False)
adata_output = adata_output[obs_index].copy()

# remove all layers except for counts
print(">> Remove all layers except for counts", flush=True)
for key in list(adata_output.layers.keys()):
Expand All @@ -70,11 +70,12 @@

# copy adata to train_set, test_set
print(">> Create AnnData output objects", flush=True)
train_uns_keys = ["dataset_id", "dataset_organism"]
output_train = ad.AnnData(
layers={"counts": X_train},
obs=adata_output.obs[[]],
var=adata_output.var[[]],
uns={"dataset_id": adata_output.uns["dataset_id"]}
uns={key: adata_output.uns[key] for key in train_uns_keys}
)
test_uns_keys = ["dataset_id", "dataset_name", "dataset_url", "dataset_reference", "dataset_summary", "dataset_description", "dataset_organism"]
output_test = ad.AnnData(
Expand Down
Loading

0 comments on commit 9c77313

Please sign in to comment.