Prepare task for adding foundation models (#24)

* Update common submodule * Use checkItemAllowed() for benchmark method check * Replace cxg_mouse_pancreas_atlas with cxg_immune_cell_atlas * Update README * Update CHANGELOG * Add a base method API schema * Update CHANGELOG * Add config check to base method schema * Add dataset_organism to training dataset files
openproblems-bio · Dec 9, 2024 · 9c77313 · 9c77313
1 parent bfa2730
commit 9c77313
Show file tree

Hide file tree

Showing 27 changed files with 141 additions and 100 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -38,6 +38,18 @@
 
 * Update workflows to use core repository dependency (PR #20).
 
+* Update the `common` submodule (PR #24)
+
+* Use the common `checkItemAllowed()` for the method check in the benchmark workflow (PR #24)
+
+* Use the `cxg_immune_cell_atlas` dataset instead of the `cxg_mouse_pancreas_atlas` for testing (PR #24)
+
+* Update `README` (PR #24)
+
+* Add a base method API schema (PR #24)
+
+* Add `dataset_organism` to training input files (PR #24)
+
 ## BUG FIXES
 
 * Update the nextflow workflow dependencies (PR #17).
@@ -57,7 +69,7 @@
 * `process_dataset`: Added a component for processing common datasets into task-ready dataset objects.
 
 * `resources_test/denoising/pancreas` with `src/tasks/denoising/resources_test_scripts/pancreas.sh`.
-  
+
 * `workflows/run`: Added nf-tower test script. (PR #205)
 
 ### V1 MIGRATION
@@ -81,7 +93,7 @@
 ### Changes from V1
 
 * Anndata layers are used to store data instead of obsm
-  
+
 * extended the use of sparse data in methods unless it was not possible
 
-* process_dataset also removes unnecessary data from train and test datasets not needed by the methods and metrics.
+* process_dataset also removes unnecessary data from train and test datasets not needed by the methods and metrics.
diff --git a/README.md b/README.md
@@ -45,16 +45,16 @@ dataset.
 ## API
 
 ``` mermaid
-flowchart LR
-  file_common_dataset("Common Dataset")
-  comp_data_processor[/"Data processor"/]
-  file_test("Test data")
-  file_train("Training data")
-  comp_control_method[/"Control Method"/]
-  comp_metric[/"Metric"/]
-  comp_method[/"Method"/]
-  file_prediction("Denoised data")
-  file_score("Score")
+flowchart TB
+  file_common_dataset("<a href='https://github.com/openproblems-bio/task_denoising#file-format-common-dataset'>Common Dataset</a>")
+  comp_data_processor[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-data-processor'>Data processor</a>"/]
+  file_test("<a href='https://github.com/openproblems-bio/task_denoising#file-format-test-data'>Test data</a>")
+  file_train("<a href='https://github.com/openproblems-bio/task_denoising#file-format-training-data'>Training data</a>")
+  comp_control_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-control-method'>Control Method</a>"/]
+  comp_metric[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-metric'>Metric</a>"/]
+  comp_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-method'>Method</a>"/]
+  file_prediction("<a href='https://github.com/openproblems-bio/task_denoising#file-format-denoised-data'>Denoised data</a>")
+  file_score("<a href='https://github.com/openproblems-bio/task_denoising#file-format-score'>Score</a>")
   file_common_dataset---comp_data_processor
   comp_data_processor-->file_test
   comp_data_processor-->file_train
@@ -72,8 +72,7 @@ flowchart LR
 
 A subset of the common dataset.
 
-Example file:
-`resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad`
+Example file: `resources_test/common/cxg_immune_cell_atlas/dataset.h5ad`
 
 Format:
 
@@ -125,7 +124,7 @@ Arguments:
 The subset of molecules used for the test dataset
 
 Example file:
-`resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad`
+`resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad`
 
 Format:
 
@@ -160,7 +159,7 @@ Data structure:
 The subset of molecules used for the training dataset
 
 Example file:
-`resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad`
+`resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad`
 
 Format:
 
@@ -235,7 +234,7 @@ Arguments:
 A denoised dataset as output by a method.
 
 Example file:
-`resources_test/task_denoising/cxg_mouse_pancreas_atlas/denoised.h5ad`
+`resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad`
 
 Format:
 
@@ -264,7 +263,7 @@ Data structure:
 File indicating the score of a metric.
 
 Example file:
-`resources_test/task_denoising/cxg_mouse_pancreas_atlas/score.h5ad`
+`resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad`
 
 Format:
 

diff --git a/common b/common
diff --git a/scripts/create_resources/test_resources.sh b/scripts/create_resources/test_resources.sh
@@ -15,31 +15,31 @@ mkdir -p $DATASET_DIR
 
 # process dataset
 viash run src/data_processors/process_dataset/config.vsh.yaml -- \
-  --input $RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad \
-  --output_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
-  --output_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad
+  --input $RAW_DATA/cxg_immune_cell_atlas/dataset.h5ad \
+  --output_train $DATASET_DIR/cxg_immune_cell_atlas/train.h5ad \
+  --output_test $DATASET_DIR/cxg_immune_cell_atlas/test.h5ad
 
 # run one method
 viash run src/methods/magic/config.vsh.yaml -- \
-    --input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
-    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/denoised.h5ad
+    --input_train $DATASET_DIR/cxg_immune_cell_atlas/train.h5ad \
+    --output $DATASET_DIR/cxg_immune_cell_atlas/denoised.h5ad
 
 # run one metric
 viash run src/metrics/poisson/config.vsh.yaml -- \
-    --input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/denoised.h5ad \
-    --input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
-    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad
+    --input_prediction $DATASET_DIR/cxg_immune_cell_atlas/denoised.h5ad \
+    --input_test $DATASET_DIR/cxg_immune_cell_atlas/test.h5ad \
+    --output $DATASET_DIR/cxg_immune_cell_atlas/score.h5ad
 
 # write manual state.yaml. this is not actually necessary but you never know it might be useful
-cat > $DATASET_DIR/cxg_mouse_pancreas_atlas/state.yaml << HERE
-id: cxg_mouse_pancreas_atlas
+cat > $DATASET_DIR/cxg_immune_cell_atlas/state.yaml << HERE
+id: cxg_immune_cell_atlas
 train: !file train.h5ad
 test: !file test.h5ad
 prediction: !file denoised.h5ad
 score: !file score.h5ad
 HERE
 
 # only run this if you have access to the openproblems-data bucket
-aws s3 sync --profile OP \
+aws s3 sync --profile op \
   "$DATASET_DIR" s3://openproblems-data/resources_test/task_denoising \
   --delete --dryrun
diff --git a/scripts/run_benchmark/run_test_local.sh b/scripts/run_benchmark/run_test_local.sh
@@ -20,8 +20,8 @@ nextflow run . \
   -profile docker \
   -resume \
   -c common/nextflow_helpers/labels_ci.config \
-  --id cxg_mouse_pancreas_atlas \
-  --input_train resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad \
-  --input_test resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad \
+  --id cxg_immune_cell_atlas \
+  --input_train resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad \
+  --input_test resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad \
   --output_state state.yaml \
   --publish_dir "$publish_dir"
diff --git a/scripts/run_benchmark/run_test_seqeracloud.sh b/scripts/run_benchmark/run_test_seqeracloud.sh
@@ -13,9 +13,9 @@ publish_dir_s3="s3://openproblems-nextflow/temp/results/task_denoising/$(date +%
 
 # write the parameters to file
 cat > /tmp/params.yaml << HERE
-id: cxg_mouse_pancreas_atlas
-input_train: $resources_test_s3/cxg_mouse_pancreas_atlas/train.h5ad
-input_test: $resources_test_s3/cxg_mouse_pancreas_atlas/test.h5ad
+id: cxg_immune_cell_atlas
+input_train: $resources_test_s3/cxg_immune_cell_atlas/train.h5ad
+input_test: $resources_test_s3/cxg_immune_cell_atlas/test.h5ad
 output_state: "state.yaml"
 publish_dir: $publish_dir_s3
 HERE

diff --git a/src/api/base_method.yaml b/src/api/base_method.yaml
@@ -0,0 +1,20 @@
+namespace: "methods"
+info:
+  type: method
+  type_info:
+    label: Method
+    summary: A method.
+    description: |
+      A denoising method to remove noise (i.e. technical artifacts) from a dataset.
+arguments:
+  - name: --input_train
+    __merge__: file_train.yaml
+    required: true
+    direction: input
+  - name: --output
+    __merge__: file_prediction.yaml
+    required: true
+    direction: output
+test_resources:
+  - type: python_script
+    path: /common/component_tests/check_config.py
diff --git a/src/api/comp_control_method.yaml b/src/api/comp_control_method.yaml
@@ -9,7 +9,7 @@ info:
       but also receive the solution object as input. It serves as a
       starting point to test the relative accuracy of new methods in
       the task, and also as a quality control for the metrics defined
-      in the task. 
+      in the task.
 arguments:
   - name: --input_train
     __merge__: file_train.yaml
@@ -29,5 +29,5 @@ test_resources:
   - type: python_script
     path: /common/component_tests/check_config.py
   - path: /common/library.bib
-  - path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
-    dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
+  - path: /resources_test/task_denoising/cxg_immune_cell_atlas
+    dest: resources_test/task_denoising/cxg_immune_cell_atlas
diff --git a/src/api/comp_data_processor.yaml b/src/api/comp_data_processor.yaml
@@ -22,5 +22,5 @@ arguments:
 test_resources:
   - type: python_script
     path: /common/component_tests/run_and_check_output.py
-  - path: /resources_test/common/cxg_mouse_pancreas_atlas
-    dest: resources_test/common/cxg_mouse_pancreas_atlas
+  - path: /resources_test/common/cxg_immune_cell_atlas
+    dest: resources_test/common/cxg_immune_cell_atlas
diff --git a/src/api/comp_method.yaml b/src/api/comp_method.yaml
@@ -1,25 +1,9 @@
-namespace: "methods"
-info:
-  type: method
-  type_info:
-    label: Method
-    summary: A method.
-    description: |
-      A denoising method to remove noise (i.e. technical artifacts) from a dataset.
-arguments:
-  - name: --input_train
-    __merge__: file_train.yaml
-    required: true
-    direction: input
-  - name: --output
-    __merge__: file_prediction.yaml
-    required: true
-    direction: output
+__merge__: base_method.yaml
 test_resources:
   - type: python_script
     path: /common/component_tests/run_and_check_output.py
   - type: python_script
     path: /common/component_tests/check_config.py
   - path: /common/library.bib
-  - path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
-    dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
+  - path: /resources_test/task_denoising/cxg_immune_cell_atlas
+    dest: resources_test/task_denoising/cxg_immune_cell_atlas
diff --git a/src/api/comp_metric.yaml b/src/api/comp_metric.yaml
@@ -25,5 +25,5 @@ test_resources:
   - type: python_script
     path: /common/component_tests/run_and_check_output.py
   - path: /common/library.bib
-  - path: /resources_test/task_denoising/cxg_mouse_pancreas_atlas
-    dest: resources_test/task_denoising/cxg_mouse_pancreas_atlas
+  - path: /resources_test/task_denoising/cxg_immune_cell_atlas
+    dest: resources_test/task_denoising/cxg_immune_cell_atlas
diff --git a/src/api/file_common_dataset.yaml b/src/api/file_common_dataset.yaml
@@ -1,11 +1,11 @@
 type: file
-example: "resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad"
+example: "resources_test/common/cxg_immune_cell_atlas/dataset.h5ad"
 label: "Common Dataset"
 summary: A subset of the common dataset.
 info:
   format:
     type: h5ad
-    layers: 
+    layers:
       - type: integer
         name: counts
         description: Raw counts
@@ -15,7 +15,7 @@ info:
         name: batch
         description: Batch information
         required: false
-      
+
     uns:
       - type: string
         name: dataset_id

diff --git a/src/api/file_prediction.yaml b/src/api/file_prediction.yaml
@@ -1,5 +1,5 @@
 type: file
-example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/denoised.h5ad"
+example: "resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad"
 label: "Denoised data"
 summary: A denoised dataset as output by a method.
 info:
@@ -18,4 +18,4 @@ info:
       - type: string
         name: method_id
         description: "A unique identifier for the method"
-        required: true
+        required: true
diff --git a/src/api/file_score.yaml b/src/api/file_score.yaml
@@ -1,5 +1,5 @@
 type: file
-example: resources_test/task_denoising/cxg_mouse_pancreas_atlas/score.h5ad
+example: resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad
 label: Score
 summary: "File indicating the score of a metric."
 info:
@@ -23,4 +23,4 @@ info:
         name: metric_values
         description: "The metric values obtained for the given prediction. Must be of same length as 'metric_ids'."
         multiple: true
-        required: true
+        required: true
diff --git a/src/api/file_test.yaml b/src/api/file_test.yaml
@@ -1,11 +1,11 @@
 type: file
-example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad"
+example: "resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad"
 label: "Test data"
 summary: The subset of molecules used for the test dataset
 info:
   format:
     type: h5ad
-    layers: 
+    layers:
       - type: integer
         name: counts
         description: Raw counts
@@ -42,4 +42,4 @@ info:
       - name: train_sum
         type: integer
         description: The total number of counts in the training dataset.
-        required: true
+        required: true
diff --git a/src/api/file_train.yaml b/src/api/file_train.yaml
@@ -1,11 +1,11 @@
 type: file
-example: "resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad"
+example: "resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad"
 label: "Training data"
 summary: The subset of molecules used for the training dataset
 info:
   format:
     type: h5ad
-    layers: 
+    layers:
       - type: integer
         name: counts
         description: Raw counts
@@ -14,4 +14,8 @@ info:
       - type: string
         name: dataset_id
         description: "A unique identifier for the dataset"
-        required: true
+        required: true
+      - name: dataset_organism
+        type: string
+        description: The organism of the sample in the dataset.
+        required: false
diff --git a/src/control_methods/perfect_denoising/script.py b/src/control_methods/perfect_denoising/script.py
@@ -2,8 +2,8 @@
 
 ## VIASH START
 par = {
-    'input_train': 'resources_test/task_denoising/cxg_mouse_pancreas_atlas/train.h5ad',
-    'input_test': 'resources_test/task_denoising/cxg_mouse_pancreas_atlas/test.h5ad',
+    'input_train': 'resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad',
+    'input_test': 'resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad',
     'output': 'output_PD.h5ad',
 }
 meta = {

diff --git a/src/data_processors/process_dataset/script.py b/src/data_processors/process_dataset/script.py
@@ -45,7 +45,7 @@
     obs_filt = np.ones(dtype=np.bool_, shape=adata_output.n_obs)
     obs_index = np.random.choice(np.where(obs_filt)[0], par["n_obs_limit"], replace=False)
     adata_output = adata_output[obs_index].copy()
-        
+
 # remove all layers except for counts
 print(">> Remove all layers except for counts", flush=True)
 for key in list(adata_output.layers.keys()):
@@ -70,11 +70,12 @@
 
 # copy adata to train_set, test_set
 print(">> Create AnnData output objects", flush=True)
+train_uns_keys = ["dataset_id", "dataset_organism"]
 output_train = ad.AnnData(
     layers={"counts": X_train},
     obs=adata_output.obs[[]],
     var=adata_output.var[[]],
-    uns={"dataset_id": adata_output.uns["dataset_id"]}
+    uns={key: adata_output.uns[key] for key in train_uns_keys}
 )
 test_uns_keys = ["dataset_id", "dataset_name", "dataset_url", "dataset_reference", "dataset_summary", "dataset_description", "dataset_organism"]
 output_test = ad.AnnData(