Merge pull request #372 from AlexsLemonade/development

Release v0.1.6 - data integration module
AlexsLemonade · Aug 16, 2023 · 60d265d · 60d265d
2 parents cbcea03 + 736d5f0
commit 60d265d
Show file tree

Hide file tree

Showing 18 changed files with 1,321 additions and 248 deletions.
diff --git a/README.md b/README.md
diff --git a/additional-docs/additional-parameters.md b/additional-docs/additional-parameters.md
@@ -12,6 +12,7 @@ These parameters are all included in the config files and can optionally be alte
   - [Dimensionality reduction and clustering parameters](#dimensionality-reduction-and-clustering-parameters)
 - [Clustering analysis parameters](#clustering-analysis-parameters)
 - [Genes of interest analysis parameters](#genes-of-interest-analysis-parameters)
+- [Integration analysis parameters](#integration-analysis-parameters)
 
 <!-- END doctoc generated TOC please keep comment here to allow auto update -->
 
@@ -89,4 +90,22 @@ The following gene mapping parameters found in the `config/goi_config.yaml` file
 
 
 |[View Genes of Interest Config File](../config/goi_config.yaml)|
-|---|
+|---|
+
+## Integration analysis parameters
+
+The [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html), `config/integration_config.yaml` sets the defaults for all parameters needed to run the data integration workflow.
+It is **not required** to alter these parameters to run the workflow.
+If you would like to change the integration method(s) or the number of multi-processing threads to use, you can do so by changing these parameters via a text editor of your choice or at the command line per our documentation [here](./command-line-options.md). 
+
+The parameters found in the `config/integration_config.yaml` file can be optionally modified and are as follows:
+
+| Parameter        | Description | Default value |
+|------------------|-------------|---------------|
+| `threads` | the number of multiprocessing threads to use | 1 |
+| `integration_method` | the method(s) to be used for integration; to include multiple integration methods use a comma separated list. Currently, the workflow only supports `fastMNN` or `harmony`. | `"fastMNN,harmony"` |
+| `batch_column` | the name of the column in the `SingleCellExperiment` object indicating the original library each cell was derived from | `"library_id"` |
+| `cell_id_column` | the name of the column in the `SingleCellExperiment` object containing the cell barcode | `"cell_id"` |
+
+|[View Integration Config File](../config/integration_config.yaml)|
+|---|
diff --git a/additional-docs/independent-installation-instructions.md b/additional-docs/independent-installation-instructions.md
@@ -2,7 +2,14 @@
 
 If you would like to perform package and dependency installation without the conda environments as described in the main `README.md` file [here](./README.md##snakemakeconda-installation), you can do so after confirming that you have R version 4.2 installed.
 Then follow the below instructions to ensure that you have all of the necessary R packages to run the workflow installed as well.
-First install the `optparse` and `renv` packages by your preferred method.
+First install the following packages by your preferred method (the package version we used for development is in parentheses):
+
+- `optparse` (1.7.3)
+- `renv` (0.17.0)
+- `rmarkdown` (2.20)
+- `here` (1.0.1)
+- `pandoc` (2.19.2)
+
 Then, from within the `scpca-downstream-analyses` directory, run the following command to install all of the additional required packages:
 
 ```

diff --git a/components/dependencies.R b/components/dependencies.R
@@ -7,5 +7,6 @@
 #
 # library(dplyr)
 #
-library(uwot)
+library(remotes)
 library(markdown)
+library(uwot)
diff --git a/config/integration_config.yaml b/config/integration_config.yaml
@@ -0,0 +1,11 @@
+# All parameters included in this file can be altered at the command line using the `--config` flag or by editing this file directly.
+
+### Project-specific parameters
+results_dir: "example-results"
+integration_project_metadata: "example-data/project-metadata/example-integration-library-metadata.tsv"
+
+### Processing parameters
+threads: 1 # number of multiprocessing threads to use
+integration_method: "fastMNN,harmony" # method(s) to be used for integration
+batch_column: "library_id" # the name of the SCE column that contains batch labels
+cell_id_column: "cell_id" # the name of the SCE column variable indicating the original cell barcode
diff --git a/envs/scpca-renv.post-deploy.sh b/envs/scpca-renv.post-deploy.sh
@@ -6,11 +6,20 @@ if [[ "$(uname)" == 'Darwin' && "$(uname -m)" == 'arm64' ]]; then
 fi
 
 
-SCPCATOOLS_VERS='v0.2.1'
+SCPCATOOLS_VERS='v0.2.3'
 
-# install github packages
+# install packages not on conda-forge
 Rscript --vanilla -e \
   "
+  # install Harmony from CRAN
+  remotes::install_version(
+    'harmony',
+    version = '0.1.1',
+    repos = 'https://cloud.r-project.org',
+    upgrade = FALSE
+  )
+
+  # install ScPCA tools from Github
   remotes::install_github('AlexsLemonade/scpcaTools', ref='${SCPCATOOLS_VERS}', upgrade='never')
   require(scpcaTools) # check installation
   "

diff --git a/envs/scpca-renv.yaml b/envs/scpca-renv.yaml
@@ -5,6 +5,7 @@ channels:
   - defaults
 dependencies:
   - bioconductor-annotationdbi=1.60.0
+  - bioconductor-batchelor=1.14.0
   - bioconductor-beachmat=2.14.0
   - bioconductor-biobase=2.58.0
   - bioconductor-biocgenerics=0.44.0
@@ -35,6 +36,7 @@ dependencies:
   - bioconductor-org.hs.eg.db=3.16.0
   - bioconductor-org.mm.eg.db=3.16.0
   - bioconductor-qvalue=2.30.0
+  - bioconductor-residualmatrix=1.8.0
   - bioconductor-rhdf5=2.42.0
   - bioconductor-rhdf5filters=1.10.0
   - bioconductor-rhdf5lib=1.20.0
@@ -49,16 +51,18 @@ dependencies:
   - bioconductor-tximport=1.26.0
   - bioconductor-xvector=0.38.0
   - bioconductor-zlibbioc=1.44.0
+  - conda-ecosystem-user-package-isolation=1.0
   - pandoc=2.19.2
   - r-abind=1.4_5
   - r-askpass=1.1
   - r-assertthat=0.2.1
   - r-backports=1.4.1
-  - r-base=4.2.2
+  - r-base=4.2.3
   - r-base64enc=0.1_3
   - r-beeswarm=0.4.0
   - r-bh=1.81.0_1
-  - r-biocmanager=1.30.20
+  - r-bigd=0.2.0
+  - r-biocmanager=1.30.22
   - r-bit=4.0.5
   - r-bit64=4.0.5
   - r-bitops=1.0_7
@@ -121,10 +125,10 @@ dependencies:
   - r-future=1.31.0
   - r-future.apply=1.10.0
   - r-generics=0.1.3
+  - r-geometry=0.4.7
   - r-gert=1.9.2
   - r-getopt=1.20.3
   - r-getoptlong=1.0.5
-  - r-geometry=0.4.7
   - r-ggbeeswarm=0.7.1
   - r-ggforce=0.4.1
   - r-ggplot2=3.4.1
@@ -141,6 +145,7 @@ dependencies:
   - r-gower=1.0.1
   - r-gridextra=2.3
   - r-grr=0.9.5
+  - r-gt=0.9.0
   - r-gtable=0.3.1
   - r-gtools=3.9.4
   - r-hardhat=1.2.0
@@ -161,6 +166,7 @@ dependencies:
   - r-iterators=1.0.14
   - r-jquerylib=0.1.4
   - r-jsonlite=1.8.4
+  - r-juicyjuice=0.1.0
   - r-kableextra=1.3.4
   - r-kernsmooth=2.23_20
   - r-knitr=1.42
@@ -207,7 +213,7 @@ dependencies:
   - r-pkgbuild=1.4.0
   - r-pkgconfig=2.0.3
   - r-pkgdown=2.0.7
-  - r-pkgload=1.3.2
+  - r-pkgload=1.3.2.1
   - r-plogr=0.2.0
   - r-plyr=1.8.8
   - r-png=0.1_8
@@ -243,13 +249,15 @@ dependencies:
   - r-rcppparallel=5.1.6
   - r-rcppprogress=0.4.2
   - r-rcurl=1.98_1.10
+  - r-reactable=0.4.4
+  - r-reactr=0.4.4
   - r-readr=2.1.4
   - r-readxl=1.4.2
   - r-recipes=1.0.5
   - r-rematch=1.0.1
   - r-rematch2=2.1.2
-  - r-remotes=2.4.2
-  - r-renv=0.17.0
+  - r-remotes=2.4.2.1
+  - r-renv=1.0.0
   - r-reshape2=1.4.4
   - r-rio=0.5.29
   - r-rjson=0.2.21
@@ -271,7 +279,7 @@ dependencies:
   - r-selectr=0.4_2
   - r-sessioninfo=1.2.2
   - r-shape=1.4.6
-  - r-shiny=1.7.4
+  - r-shiny=1.7.4.1
   - r-sitmo=2.0.2
   - r-snow=0.4_4
   - r-sourcetools=0.1.7_1
@@ -300,6 +308,7 @@ dependencies:
   - r-usethis=2.1.6
   - r-utf8=1.2.3
   - r-uwot=0.1.14
+  - r-v8=4.3.3
   - r-vctrs=0.5.2
   - r-vipor=0.4.5
   - r-viridis=0.6.2
@@ -316,5 +325,4 @@ dependencies:
   - r-yaml=2.3.7
   - r-zip=2.2.2
 variables:
-  R_LIBS_USER: null
   RENV_PROJECT: null
diff --git a/example-data/project-metadata/example-integration-library-metadata.tsv b/example-data/project-metadata/example-integration-library-metadata.tsv
@@ -0,0 +1,3 @@
+sample_id	library_id	processed_sce_filepath	integration_group
+sample01	library01	example-results/sample01/library01_processed.rds	group01
+sample02	library02	example-results/sample02/library02_processed.rds	group01
diff --git a/integration.snakefile b/integration.snakefile
@@ -0,0 +1,81 @@
+import pandas as pd
+
+configfile: "config/config.yaml"
+configfile: "config/integration_config.yaml"
+
+# getting the samples information
+if os.path.exists(config['integration_project_metadata']):
+  samples_information = pd.read_csv(config['integration_project_metadata'], sep='\t', index_col=False)
+
+  # get a list of the file paths and integration groups
+  GROUP = list(samples_information['integration_group'])
+else:
+  # If the metadata file is missing, warn and fill with empty lists
+  print(f"Warning: Project metadata file '{config['integration_project_metadata']}' is missing.")
+  samples_information = None
+  GROUP = list()
+
+rule target:
+    input:
+        expand(os.path.join(config["results_dir"], "{group}_integrated_sce.rds"),
+               zip,
+               group = GROUP),
+        expand(os.path.join(config["results_dir"], "{group}_integration_report.html"),
+               zip,
+               group = GROUP)
+
+rule merge_sces:
+    input:
+        config["integration_project_metadata"]
+    output:
+        temp(os.path.join(config["results_dir"], "{group}_merged_sce.rds"))
+    log: os.path.join("logs", config["results_dir"], "{group}_merge_sce.log")
+    conda: "envs/scpca-renv.yaml"
+    shell:
+        " Rscript --vanilla 'optional-integration-analysis/merge-sce.R'"
+        "  --input_metadata_tsv {input}"
+        "  --integration_group {wildcards.group}"
+        "  --output_sce_file {output}"
+        "  --n_hvg {config[n_genes_pca]}"
+        "  --threads {config[threads]}"
+        "  --project_root $PWD"
+        " &> {log}"
+
+rule perform_integration:
+    input:
+        "{basedir}/{group}_merged_sce.rds"
+    output:
+        "{basedir}/{group}_integrated_sce.rds"
+    log: "logs/{basedir}/{group}_perform_integration.log"
+    conda: "envs/scpca-renv.yaml"
+    shell:
+        " Rscript --vanilla 'optional-integration-analysis/perform-integration.R'"
+        "  --merged_sce_file {input}"
+        "  --integration_method {config[integration_method]}"
+        "  --fastmnn_auto_merge"
+        "  --output_sce_file {output}"
+        "  --project_root $PWD"
+        " &> {log}"
+
+rule generate_integration_report:
+    input:
+        integrated_sce = "{basedir}/{group}_integrated_sce.rds"
+    output:
+        "{basedir}/{group}_integration_report.html"
+    log: "logs/{basedir}/{group}/integration_report.log"
+    conda: "envs/scpca-renv.yaml"
+    shell:
+        """
+        Rscript --vanilla -e "
+          rmarkdown::render('optional-integration-analysis/integration-report-template.Rmd',
+                            clean = TRUE,
+                            output_file = '{output}',
+                            output_dir = dirname('{output}'),
+                            params = list(integration_group = '{wildcards.group}',
+                                          integrated_sce = '{input.integrated_sce}',
+                                          integration_method = '{config[integration_method]}',
+                                          batch_column = '{config[batch_column]}',
+                                          cell_id_column = '{config[cell_id_column]}'),
+                           envir = new.env())
+        " &> {log}
+        """
diff --git a/optional-clustering-analysis/README.md b/optional-clustering-analysis/README.md
@@ -1,6 +1,6 @@
 # Optional Clustering Analysis
 
-This directory includes a clustering analysis workflow that can help users identify the optimal clustering method and parameters for each library in their dataset. 
+This directory includes a clustering analysis workflow that can help users identify the optimal clustering method and parameters for each library in their dataset.
 
 **The clustering analysis workflow cannot be implemented until after users have successfully run the main downstream analysis core workflow as described in this repository's main [README.md](../README.md) file or have downloaded data from the [ScPCA portal](https://scpca.alexslemonade.org/).**
 
@@ -39,7 +39,7 @@ Additionally, metrics associated with each of the clustering results such as sil
 The plots are displayed in a html report for ease of reference.
 
 **Note** that the same [software requirements for the core workflow](../README.md#3-additional-dependencies) are also required for this clustering workflow.
-R 4.2 is required for running our pipeline, along with Bioconductor 3.15.
+R 4.2 is required for running our pipeline, along with Bioconductor 3.16.
 Package dependencies for the analysis workflows in this repository are managed using [`renv`](https://rstudio.github.io/renv/index.html), which must be installed locally prior to running the workflow.
 If you are using conda, dependencies can be installed as [part of the initial setup](../README.md#snakemakeconda-installation).
 
@@ -61,32 +61,32 @@ Learn more about snakemake configuration files [here](https://snakemake.readthed
 
 The config file contains two sets of parameters:
 
-- **[Project-specific Parameters](../config/config.yaml#L3)**: This set of parameters are for specifying dataset or project related details. 
+- **[Project-specific Parameters](../config/config.yaml#L3)**: This set of parameters are for specifying dataset or project related details.
 These parameters are **required** to run the workflow on your data.
 - **[Processing Parameters](../config/cluster_config.yaml)**: This set of parameters specify configurations for the type(s) of graph-based clustering to be performed, as well as the range of nearest neighbors values to use.
 You can change them to explore your data but it is optional.
 You can modify the relevant parameters by manually updating the `config/cluster_config.yaml` file using a text editor of your choice.
 
 To run the workflow on your data, modify the following parameters in the `config/config.yaml` file:
 
-| Parameter        | Description |
-|------------------|-------------|
-| `input_data_dir` | full path to the directory where the input data files can be found (default will be the `results_dir` used in the core workflow) |
-| `results_dir` | full path to the directory where output files will be stored |
-| `project_metadata` | full path to your specific project metadata TSV file (use the same `project_metadata` used in the prerequisite core workflow) |
+| Parameter          | Description                                                                                                                      |
+| ------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
+| `input_data_dir`   | full path to the directory where the input data files can be found (default will be the `results_dir` used in the core workflow) |
+| `results_dir`      | full path to the directory where output files will be stored                                                                     |
+| `project_metadata` | full path to your specific project metadata TSV file (use the same `project_metadata` used in the prerequisite core workflow)    |
 
-|[View Config File](../config/config.yaml)|
-|---|
+| [View Config File](../config/config.yaml) |
+| ----------------------------------------- |
 
 The [`config/cluster_config.yaml`](../config/cluster_config.yaml) file also contains additional processing parameters like the type of graph-based clustering to be performed and the nearest neighbors values that should be used.
-We have set default values for these parameters. 
+We have set default values for these parameters.
 Learn more about the [processing parameters](../additional-docs/processing-parameters.md#clustering-analysis-parameters) and how to modify them.
 
 ## Running the workflow
 
 The execution file with the clustering Snakemake workflow is named `cluster.snakefile` and can be found in the root directory. To tell snakemake to run the specific clustering workflow be sure to use the `--snakefile` or `-s` option followed by the name of the snakefile, `cluster.snakefile`.
 
-After you have successfully modified the required project-specific parameters in the config file and navigated to within the root directory of the `scpca-downstream-analyses` repository, you can run the clustering Snakemake workflow with just the `--cores` and `--use-conda` flags as in the following example: 
+After you have successfully modified the required project-specific parameters in the config file and navigated to within the root directory of the `scpca-downstream-analyses` repository, you can run the clustering Snakemake workflow with just the `--cores` and `--use-conda` flags as in the following example:
 
 ```
 snakemake --snakefile cluster.snakefile --cores 2 --use-conda
@@ -127,7 +127,7 @@ You can also download a ZIP file with an example of the output from running the
 
 ### What to expect in the output `SingleCellExperiment` object
 
-In the [`colData`](https://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html#handling-metadata) of the output `SingleCellExperiment` object, you can find the following:
+In the [`colData`](https://.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html#handling-metadata) of the output `SingleCellExperiment` object, you can find the following:
 
 - Clustering results stored in a metadata column named using the associated clustering type and nearest neighbours values.
 For example, where `n` is a value within a range of nearest neighbors values provided to perform Louvain clustering, the column name would be `louvain_n` and can be accessed using `colData(sce)$louvain_n`.