diff --git a/README.md b/README.md index c85ded8..453ed56 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,19 @@ discovery. Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for multi-modal data shows that this is not trivial. +## Authors & contributors + +| name | roles | +|:-------------------|:-------------------| +| Alejandro Granados | author | +| Alex Tong | author | +| Bastian Rieck | author | +| Daniel Burkhardt | author | +| Kai Waldrant | contributor | +| Kaiwen Deng | contributor | +| Louise Deconinck | author | +| Robrecht Cannoodt | author, maintainer | + ## API ``` mermaid @@ -348,8 +361,8 @@ Arguments: | Name | Type | Description | |:---|:---|:---| -| `--input_train_mod1` | `file` | The mod1 expression values of the train cells. | -| `--input_train_mod2` | `file` | The mod2 expression values of the train cells. | +| `--input_train_mod1` | `file` | (*Optional*) The mod1 expression values of the train cells. | +| `--input_train_mod2` | `file` | (*Optional*) The mod2 expression values of the train cells. | | `--input_test_mod1` | `file` | The mod1 expression values of the test cells. | | `--input_model` | `file` | A pretrained model for predicting the expression of one modality from another. | | `--output` | `file` | (*Output*) A prediction of the mod2 expression values of the test cells. | @@ -516,3 +529,4 @@ Data structure: | `uns["gene_activity_var_names"]` | `string` | (*Optional*) Names of the gene activity matrix. | + diff --git a/README.qmd b/README.qmd deleted file mode 100644 index 796cd31..0000000 --- a/README.qmd +++ /dev/null @@ -1,529 +0,0 @@ ---- -title: "Predict Modality" -format: gfm ---- - - - -Predicting the profiles of one modality (e.g. protein abundance) from another (e.g. mRNA expression). - -Repository: [openproblems-bio/task_predict_modality](https://github.com/openproblems-bio/task_predict_modality) - - - -## Description - -Experimental techniques to measure multiple modalities within the same single cell are increasingly becoming available. -The demand for these measurements is driven by the promise to provide a deeper insight into the state of a cell. -Yet, the modalities are also intrinsically linked. We know that DNA must be accessible (ATAC data) to produce mRNA -(expression data), and mRNA in turn is used as a template to produce protein (protein abundance). These processes -are regulated often by the same molecules that they produce: for example, a protein may bind DNA to prevent the production -of more mRNA. Understanding these regulatory processes would be transformative for synthetic biology and drug target discovery. -Any method that can predict a modality from another must have accounted for these regulatory processes, but the demand for -multi-modal data shows that this is not trivial. - - - - -## API - -```mermaid -flowchart LR - file_common_dataset_mod1("Raw dataset RNA") - comp_process_datasets[/"Process Dataset"/] - file_test_mod1("Test mod1") - file_test_mod2("Test mod2") - file_train_mod1("Train mod1") - file_train_mod2("Train mod2") - comp_control_method[/"Control method"/] - comp_method_predict[/"Predict"/] - comp_method_train[/"Train"/] - comp_method[/"Method"/] - comp_metric[/"Metric"/] - file_prediction("Prediction") - file_pretrained_model("Pretrained model") - file_score("Score") - file_common_dataset_mod2("Raw dataset mod2") - file_common_dataset_mod1---comp_process_datasets - comp_process_datasets-->file_test_mod1 - comp_process_datasets-->file_test_mod2 - comp_process_datasets-->file_train_mod1 - comp_process_datasets-->file_train_mod2 - file_test_mod1---comp_control_method - file_test_mod1---comp_method_predict - file_test_mod1---comp_method_train - file_test_mod1---comp_method - file_test_mod2---comp_control_method - file_test_mod2---comp_metric - file_train_mod1---comp_control_method - file_train_mod1---comp_method_predict - file_train_mod1---comp_method_train - file_train_mod1---comp_method - file_train_mod2---comp_control_method - file_train_mod2---comp_method_predict - file_train_mod2---comp_method_train - file_train_mod2---comp_method - comp_control_method-->file_prediction - comp_method_predict-->file_prediction - comp_method_train-->file_pretrained_model - comp_method-->file_prediction - comp_metric-->file_score - file_prediction---comp_metric - file_pretrained_model---comp_method_predict - file_common_dataset_mod2---comp_process_datasets -``` - - -## File format: Raw dataset RNA - -The RNA modality of the raw dataset. - -Example file: `resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod1.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["feature_id"]` |`string` |Unique identifier for the feature, usually a ENSEMBL gene id. | -`var["feature_name"]` |`string` |(_Optional_) A human-readable name for the feature, usually a gene symbol. | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["dataset_name"]` |`string` |Nicely formatted name. | -`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | -`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | -`uns["dataset_summary"]` |`string` |Short description of the dataset. | -`uns["dataset_description"]` |`string` |Long description of the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - - -## Component type: Process Dataset - - - -A predict modality dataset processor. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_mod1` |`file` |The RNA modality of the raw dataset. | -`--input_mod2` |`file` |The second modality of the raw dataset. Must be an ADT or an ATAC dataset. | -`--output_train_mod1` |`file` |(_Output_) The mod1 expression values of the train cells. | -`--output_train_mod2` |`file` |(_Output_) The mod2 expression values of the train cells. | -`--output_test_mod1` |`file` |(_Output_) The mod1 expression values of the test cells. | -`--output_test_mod2` |`file` |(_Output_) The mod2 expression values of the test cells. | -`--seed` |`integer` |(_Optional_) NA. Default: `1`. | -::: - - - -## File format: Test mod1 - -The mod1 expression values of the test cells. - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod1.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'gene_ids', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | -`uns["dataset_name"]` |`string` |Nicely formatted name. | -`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | -`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | -`uns["dataset_summary"]` |`string` |Short description of the dataset. | -`uns["dataset_description"]` |`string` |Long description of the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - - -## File format: Test mod2 - -The mod2 expression values of the test cells. - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/test_mod2.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'gene_ids', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'common_dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | -`uns["dataset_name"]` |`string` |Nicely formatted name. | -`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | -`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | -`uns["dataset_summary"]` |`string` |Short description of the dataset. | -`uns["dataset_description"]` |`string` |Long description of the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - - -## File format: Train mod1 - -The mod1 expression values of the train cells. - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod1.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'gene_ids', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - - -## File format: Train mod2 - -The mod2 expression values of the train cells. - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/train_mod2.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'gene_ids', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'common_dataset_id', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["gene_ids"]` |`string` |(_Optional_) The gene identifiers (if available). | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["common_dataset_id"]` |`string` |(_Optional_) A common identifier for the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - - -## Component type: Control method - - - -Quality control methods for verifying the pipeline. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | -`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | -`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | -`--input_test_mod2` |`file` |The mod2 expression values of the test cells. | -`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | -::: - - - -## Component type: Predict - - - -Make predictions using a trained model. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | -`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | -`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | -`--input_model` |`file` |A pretrained model for predicting the expression of one modality from another. | -`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | -::: - - - -## Component type: Train - - - -Train a model to predict the expression of one modality from another. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | -`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | -`--input_test_mod1` |`file` |(_Optional_) The mod1 expression values of the test cells. | -`--output` |`file` |(_Output_) A pretrained model for predicting the expression of one modality from another. | -::: - - - -## Component type: Method - - - -A regression method. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_train_mod1` |`file` |The mod1 expression values of the train cells. | -`--input_train_mod2` |`file` |The mod2 expression values of the train cells. | -`--input_test_mod1` |`file` |The mod1 expression values of the test cells. | -`--output` |`file` |(_Output_) A prediction of the mod2 expression values of the test cells. | -::: - - - -## Component type: Metric - - - -A predict modality metric. - -Arguments: - -:::{.small} -Name |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`--input_prediction` |`file` |A prediction of the mod2 expression values of the test cells. | -`--input_test_mod2` |`file` |The mod2 expression values of the test cells. | -`--output` |`file` |(_Output_) Metric score file. | -::: - - - -## File format: Prediction - -A prediction of the mod2 expression values of the test cells - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/prediction.h5ad` - - - -Format: - -:::{.small} - AnnData object - layers: 'normalized' - uns: 'dataset_id', 'method_id' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`layers["normalized"]` |`double` |Predicted normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["method_id"]` |`string` |A unique identifier for the method. | -::: - - - -## File format: Pretrained model - -A pretrained model for predicting the expression of one modality from another. - - - - - - - - -## File format: Score - -Metric score file - -Example file: `resources_test/predict_modality/openproblems_neurips2021/bmmc_cite/swap/score.h5ad` - - - -Format: - -:::{.small} - AnnData object - uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["method_id"]` |`string` |A unique identifier for the method. | -`uns["metric_ids"]` |`string` |One or more unique metric identifiers. | -`uns["metric_values"]` |`double` |The metric values obtained for the given prediction. Must be of same length as 'metric_ids'. | -::: - - - -## File format: Raw dataset mod2 - -The second modality of the raw dataset. Must be an ADT or an ATAC dataset - -Example file: `resources_test/common/openproblems_neurips2021/bmmc_cite/dataset_mod2.h5ad` - - - -Format: - -:::{.small} - AnnData object - obs: 'batch', 'size_factors' - var: 'feature_id', 'feature_name', 'hvg', 'hvg_score' - obsm: 'gene_activity' - layers: 'counts', 'normalized' - uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'gene_activity_var_names' -::: - -Data structure: - -:::{.small} -Slot |Type |Description | -:-------------------------|:--------|:------------------------------------------------------------| -`obs["batch"]` |`string` |Batch information. | -`obs["size_factors"]` |`double` |(_Optional_) The size factors of the cells prior to normalization. | -`var["feature_id"]` |`string` |Unique identifier for the feature, usually a ENSEMBL gene id. | -`var["feature_name"]` |`string` |(_Optional_) A human-readable name for the feature, usually a gene symbol. | -`var["hvg"]` |`boolean` |Whether or not the feature is considered to be a 'highly variable gene'. | -`var["hvg_score"]` |`double` |A score for the feature indicating how highly variable it is. | -`obsm["gene_activity"]` |`double` |(_Optional_) ATAC gene activity. | -`layers["counts"]` |`integer` |Raw counts. | -`layers["normalized"]` |`double` |Normalized expression values. | -`uns["dataset_id"]` |`string` |A unique identifier for the dataset. | -`uns["dataset_name"]` |`string` |Nicely formatted name. | -`uns["dataset_url"]` |`string` |(_Optional_) Link to the original source of the dataset. | -`uns["dataset_reference"]` |`string` |(_Optional_) Bibtex reference of the paper in which the dataset was published. | -`uns["dataset_summary"]` |`string` |Short description of the dataset. | -`uns["dataset_description"]` |`string` |Long description of the dataset. | -`uns["dataset_organism"]` |`string` |(_Optional_) The organism of the sample in the dataset. | -`uns["normalization_id"]` |`string` |The unique identifier of the normalization method used. | -`uns["gene_activity_var_names"]` |`string` |(_Optional_) Names of the gene activity matrix. | -::: - - -