deploy: 252731b

openproblems-bio · Dec 19, 2024 · 7aea717 · 7aea717
commit 7aea717
Show file tree

Hide file tree

Showing 155 changed files with 81,852 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,9 @@
+resources
+resources_test
+work
+.nextflow*
+.vscode
+.DS_Store
+output
+trace-*
+.ipynb_checkpoints
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "common"]
+	path = common
+	url = [email protected]:openproblems-bio/common-resources.git
diff --git a/.nojekyll b/.nojekyll
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,101 @@
+# denoising 0.1.0
+
+## BREAKING CHANGES
+
+* Update to viash 0.9.0 RC6
+
+* Directory structure has been updated.
+
+* Update to viash 0.9.0 (PR #13).
+
+## NEW FUNCTIONALITY
+
+* Add `CHANGELOG.md` (PR #7).
+
+* Update `process_dataset` component to subsample large datasets (PR #14).
+
+* Add the scPRINT method (PR #25)
+
+## MAJOR CHANGES
+
+* Revamp `scripts` directory (PR #13).
+
+* Relocated `process_datasets` to `data_processors/process_datasets` (PR #13).
+
+## MINOR CHANGES
+
+* Remove dtype parameter in `.Anndata()` (PR #6).
+
+* Fix target_sum deprecation warning in `mse` mmetric (PR #8).
+
+* Update `task_name` variable to denoising in component scripts (PR #9).
+
+* Update docker containers used in components (PR #12).
+
+* Set `numpy<2` for some failing methods (PR #13).
+
+* Small changes to api file names (PR #13).
+
+* Update test_resources path in components (PR #18).
+
+* Update workflows to use core repository dependency (PR #20).
+
+* Update the `common` submodule (PR #24)
+
+* Use the common `checkItemAllowed()` for the method check in the benchmark workflow (PR #24)
+
+* Use the `cxg_immune_cell_atlas` dataset instead of the `cxg_mouse_pancreas_atlas` for testing (PR #24)
+
+* Update `README` (PR #24)
+
+* Add a base method API schema (PR #24)
+
+* Add `dataset_organism` to training input files (PR #24)
+
+## BUG FIXES
+
+* Update the nextflow workflow dependencies (PR #17).
+
+* Fix paths in scripts (PR #18).
+
+* Subsample datasets by batch if batch is defined (PR #22).
+
+## transfer from openproblems-v2 repository
+
+### NEW FUNCTIONALITY
+
+* `api/file_*`: Created a file format specifications for the h5ad files throughout the pipeline.
+
+* `api/comp_*`: Created an api definition for the split, method and metric components.
+
+* `process_dataset`: Added a component for processing common datasets into task-ready dataset objects.
+
+* `resources_test/denoising/pancreas` with `src/tasks/denoising/resources_test_scripts/pancreas.sh`.
+
+* `workflows/run`: Added nf-tower test script. (PR #205)
+
+### V1 MIGRATION
+
+* `control_methods/no_denoising`: Migrated from v1. Extracted from baseline method
+
+* `control_methods/perfect_denoising`: Migrated from v1.Extracted from baseline method
+
+* `methods/alra`: Migrated from v1. Changed from python to R and uses lg_cpm normalised data instead of L1 sqrt
+
+* `methods/dca`: Migrated and adapted from v1.
+
+* `methods/knn_smoothing`: Migrated and adapted from v1.
+
+* `methods/magic`: Migrated from v1.
+
+* `metrics/mse`: Migrated from v1.
+
+* `metrics/poisson`: Migrated from v1.
+
+### Changes from V1
+
+* Anndata layers are used to store data instead of obsm
+
+* extended the use of sparse data in methods unless it was not possible
+
+* process_dataset also removes unnecessary data from train and test datasets not needed by the methods and metrics.
diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
@@ -0,0 +1,73 @@
+# Instructions
+
+This is a guide on what to do after you have created a new task repository from the template. More in depth information about how to create a new task can be found in the [OpenProblems Documentation](https://openproblems.bio/documentation/create_task/).
+
+## First things first
+
+* Update the `_viash.yaml` file with the correct task information.
+* Update the `src/api/task_info.yaml` file with the information you have provied in the task issue.
+
+## Resources
+
+THe OpenProblems team has provided some test resources that can be used to test the task. These resources are stored in the `resources` folder. The `scripts/download_resources.sh` script can be used to download these resources.
+
+If these resources are not sufficient, you can add more resources to the `resources` folder. The `scripts/download_resources.sh` script can be updated to download these resources.
+
+
+
+
+
+<!-- Add to readme 
+* update _viash.yaml
+* update src/api/task_info.yaml
+* update scripts/download_resources
+-->
+
+#!/bin/bash
+
+echo "This script is not supposed to be run directly."
+echo "Please run the script step-by-step."
+exit 1
+
+# sync resources
+scripts/download_resources.sh
+
+# create a new component
+method_id="my_metric"
+method_lang="python" # change this to "r" if need be
+
+common/create_component/create_component -- \
+  --language "$method_lang" \
+  --name "$method_id"
+
+# TODO: fill in required fields in src/task/methods/foo/config.vsh.yaml
+# TODO: edit src/task/methods/foo/script.py/R
+
+# test the component
+viash test src/task/methods/$method_id/config.vsh.yaml
+
+# rebuild the container (only if you change something to the docker platform)
+# You can reduce the memory and cpu allotted to jobs in _viash.yaml by modifying .platforms[.type == "nextflow"].config.labels
+viash run src/task/methods/$method_id/config.vsh.yaml -- \
+  ---setup cachedbuild ---verbose
+
+# run the method (using parquet as input)
+viash run src/task/methods/$method_id/config.vsh.yaml -- \
+  --de_train "resources/neurips-2023-kaggle/de_train.parquet" \
+  --id_map "resources/neurips-2023-kaggle/id_map.csv" \
+  --output "output/prediction.parquet"
+
+# run the method (using h5ad as input)
+viash run src/task/methods/$method_id/config.vsh.yaml -- \
+  --de_train_h5ad "resources/neurips-2023-kaggle/2023-09-12_de_by_cell_type_train.h5ad" \
+  --id_map "resources/neurips-2023-kaggle/id_map.csv" \
+  --output "output/prediction.parquet"
+
+# run evaluation metric
+viash run src/task/metrics/mean_rowwise_error/config.vsh.yaml -- \
+  --de_test "resources/neurips-2023-kaggle/de_test.parquet" \
+  --prediction "output/prediction.parquet" \
+  --output "output/score.h5ad"
+
+# print score on kaggle test dataset
+python -c 'import anndata; print(anndata.read_h5ad("output/score.h5ad").uns)'
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Open Problems in Single-Cell Analysis
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.