Skip to content

Commit

Permalink
Merge pull request #372 from AlexsLemonade/development
Browse files Browse the repository at this point in the history
Release v0.1.6 - data integration module
  • Loading branch information
allyhawkins authored Aug 16, 2023
2 parents cbcea03 + 736d5f0 commit 60d265d
Show file tree
Hide file tree
Showing 18 changed files with 1,321 additions and 248 deletions.
74 changes: 41 additions & 33 deletions README.md

Large diffs are not rendered by default.

21 changes: 20 additions & 1 deletion additional-docs/additional-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ These parameters are all included in the config files and can optionally be alte
- [Dimensionality reduction and clustering parameters](#dimensionality-reduction-and-clustering-parameters)
- [Clustering analysis parameters](#clustering-analysis-parameters)
- [Genes of interest analysis parameters](#genes-of-interest-analysis-parameters)
- [Integration analysis parameters](#integration-analysis-parameters)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

Expand Down Expand Up @@ -89,4 +90,22 @@ The following gene mapping parameters found in the `config/goi_config.yaml` file


|[View Genes of Interest Config File](../config/goi_config.yaml)|
|---|
|---|

## Integration analysis parameters

The [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html), `config/integration_config.yaml` sets the defaults for all parameters needed to run the data integration workflow.
It is **not required** to alter these parameters to run the workflow.
If you would like to change the integration method(s) or the number of multi-processing threads to use, you can do so by changing these parameters via a text editor of your choice or at the command line per our documentation [here](./command-line-options.md).

The parameters found in the `config/integration_config.yaml` file can be optionally modified and are as follows:

| Parameter | Description | Default value |
|------------------|-------------|---------------|
| `threads` | the number of multiprocessing threads to use | 1 |
| `integration_method` | the method(s) to be used for integration; to include multiple integration methods use a comma separated list. Currently, the workflow only supports `fastMNN` or `harmony`. | `"fastMNN,harmony"` |
| `batch_column` | the name of the column in the `SingleCellExperiment` object indicating the original library each cell was derived from | `"library_id"` |
| `cell_id_column` | the name of the column in the `SingleCellExperiment` object containing the cell barcode | `"cell_id"` |

|[View Integration Config File](../config/integration_config.yaml)|
|---|
9 changes: 8 additions & 1 deletion additional-docs/independent-installation-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,14 @@

If you would like to perform package and dependency installation without the conda environments as described in the main `README.md` file [here](./README.md##snakemakeconda-installation), you can do so after confirming that you have R version 4.2 installed.
Then follow the below instructions to ensure that you have all of the necessary R packages to run the workflow installed as well.
First install the `optparse` and `renv` packages by your preferred method.
First install the following packages by your preferred method (the package version we used for development is in parentheses):

- `optparse` (1.7.3)
- `renv` (0.17.0)
- `rmarkdown` (2.20)
- `here` (1.0.1)
- `pandoc` (2.19.2)

Then, from within the `scpca-downstream-analyses` directory, run the following command to install all of the additional required packages:

```
Expand Down
3 changes: 2 additions & 1 deletion components/dependencies.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@
#
# library(dplyr)
#
library(uwot)
library(remotes)
library(markdown)
library(uwot)
11 changes: 11 additions & 0 deletions config/integration_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# All parameters included in this file can be altered at the command line using the `--config` flag or by editing this file directly.

### Project-specific parameters
results_dir: "example-results"
integration_project_metadata: "example-data/project-metadata/example-integration-library-metadata.tsv"

### Processing parameters
threads: 1 # number of multiprocessing threads to use
integration_method: "fastMNN,harmony" # method(s) to be used for integration
batch_column: "library_id" # the name of the SCE column that contains batch labels
cell_id_column: "cell_id" # the name of the SCE column variable indicating the original cell barcode
13 changes: 11 additions & 2 deletions envs/scpca-renv.post-deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,20 @@ if [[ "$(uname)" == 'Darwin' && "$(uname -m)" == 'arm64' ]]; then
fi


SCPCATOOLS_VERS='v0.2.1'
SCPCATOOLS_VERS='v0.2.3'

# install github packages
# install packages not on conda-forge
Rscript --vanilla -e \
"
# install Harmony from CRAN
remotes::install_version(
'harmony',
version = '0.1.1',
repos = 'https://cloud.r-project.org',
upgrade = FALSE
)
# install ScPCA tools from Github
remotes::install_github('AlexsLemonade/scpcaTools', ref='${SCPCATOOLS_VERS}', upgrade='never')
require(scpcaTools) # check installation
"
Expand Down
24 changes: 16 additions & 8 deletions envs/scpca-renv.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
- defaults
dependencies:
- bioconductor-annotationdbi=1.60.0
- bioconductor-batchelor=1.14.0
- bioconductor-beachmat=2.14.0
- bioconductor-biobase=2.58.0
- bioconductor-biocgenerics=0.44.0
Expand Down Expand Up @@ -35,6 +36,7 @@ dependencies:
- bioconductor-org.hs.eg.db=3.16.0
- bioconductor-org.mm.eg.db=3.16.0
- bioconductor-qvalue=2.30.0
- bioconductor-residualmatrix=1.8.0
- bioconductor-rhdf5=2.42.0
- bioconductor-rhdf5filters=1.10.0
- bioconductor-rhdf5lib=1.20.0
Expand All @@ -49,16 +51,18 @@ dependencies:
- bioconductor-tximport=1.26.0
- bioconductor-xvector=0.38.0
- bioconductor-zlibbioc=1.44.0
- conda-ecosystem-user-package-isolation=1.0
- pandoc=2.19.2
- r-abind=1.4_5
- r-askpass=1.1
- r-assertthat=0.2.1
- r-backports=1.4.1
- r-base=4.2.2
- r-base=4.2.3
- r-base64enc=0.1_3
- r-beeswarm=0.4.0
- r-bh=1.81.0_1
- r-biocmanager=1.30.20
- r-bigd=0.2.0
- r-biocmanager=1.30.22
- r-bit=4.0.5
- r-bit64=4.0.5
- r-bitops=1.0_7
Expand Down Expand Up @@ -121,10 +125,10 @@ dependencies:
- r-future=1.31.0
- r-future.apply=1.10.0
- r-generics=0.1.3
- r-geometry=0.4.7
- r-gert=1.9.2
- r-getopt=1.20.3
- r-getoptlong=1.0.5
- r-geometry=0.4.7
- r-ggbeeswarm=0.7.1
- r-ggforce=0.4.1
- r-ggplot2=3.4.1
Expand All @@ -141,6 +145,7 @@ dependencies:
- r-gower=1.0.1
- r-gridextra=2.3
- r-grr=0.9.5
- r-gt=0.9.0
- r-gtable=0.3.1
- r-gtools=3.9.4
- r-hardhat=1.2.0
Expand All @@ -161,6 +166,7 @@ dependencies:
- r-iterators=1.0.14
- r-jquerylib=0.1.4
- r-jsonlite=1.8.4
- r-juicyjuice=0.1.0
- r-kableextra=1.3.4
- r-kernsmooth=2.23_20
- r-knitr=1.42
Expand Down Expand Up @@ -207,7 +213,7 @@ dependencies:
- r-pkgbuild=1.4.0
- r-pkgconfig=2.0.3
- r-pkgdown=2.0.7
- r-pkgload=1.3.2
- r-pkgload=1.3.2.1
- r-plogr=0.2.0
- r-plyr=1.8.8
- r-png=0.1_8
Expand Down Expand Up @@ -243,13 +249,15 @@ dependencies:
- r-rcppparallel=5.1.6
- r-rcppprogress=0.4.2
- r-rcurl=1.98_1.10
- r-reactable=0.4.4
- r-reactr=0.4.4
- r-readr=2.1.4
- r-readxl=1.4.2
- r-recipes=1.0.5
- r-rematch=1.0.1
- r-rematch2=2.1.2
- r-remotes=2.4.2
- r-renv=0.17.0
- r-remotes=2.4.2.1
- r-renv=1.0.0
- r-reshape2=1.4.4
- r-rio=0.5.29
- r-rjson=0.2.21
Expand All @@ -271,7 +279,7 @@ dependencies:
- r-selectr=0.4_2
- r-sessioninfo=1.2.2
- r-shape=1.4.6
- r-shiny=1.7.4
- r-shiny=1.7.4.1
- r-sitmo=2.0.2
- r-snow=0.4_4
- r-sourcetools=0.1.7_1
Expand Down Expand Up @@ -300,6 +308,7 @@ dependencies:
- r-usethis=2.1.6
- r-utf8=1.2.3
- r-uwot=0.1.14
- r-v8=4.3.3
- r-vctrs=0.5.2
- r-vipor=0.4.5
- r-viridis=0.6.2
Expand All @@ -316,5 +325,4 @@ dependencies:
- r-yaml=2.3.7
- r-zip=2.2.2
variables:
R_LIBS_USER: null
RENV_PROJECT: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample_id library_id processed_sce_filepath integration_group
sample01 library01 example-results/sample01/library01_processed.rds group01
sample02 library02 example-results/sample02/library02_processed.rds group01
81 changes: 81 additions & 0 deletions integration.snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import pandas as pd

configfile: "config/config.yaml"
configfile: "config/integration_config.yaml"

# getting the samples information
if os.path.exists(config['integration_project_metadata']):
samples_information = pd.read_csv(config['integration_project_metadata'], sep='\t', index_col=False)

# get a list of the file paths and integration groups
GROUP = list(samples_information['integration_group'])
else:
# If the metadata file is missing, warn and fill with empty lists
print(f"Warning: Project metadata file '{config['integration_project_metadata']}' is missing.")
samples_information = None
GROUP = list()

rule target:
input:
expand(os.path.join(config["results_dir"], "{group}_integrated_sce.rds"),
zip,
group = GROUP),
expand(os.path.join(config["results_dir"], "{group}_integration_report.html"),
zip,
group = GROUP)

rule merge_sces:
input:
config["integration_project_metadata"]
output:
temp(os.path.join(config["results_dir"], "{group}_merged_sce.rds"))
log: os.path.join("logs", config["results_dir"], "{group}_merge_sce.log")
conda: "envs/scpca-renv.yaml"
shell:
" Rscript --vanilla 'optional-integration-analysis/merge-sce.R'"
" --input_metadata_tsv {input}"
" --integration_group {wildcards.group}"
" --output_sce_file {output}"
" --n_hvg {config[n_genes_pca]}"
" --threads {config[threads]}"
" --project_root $PWD"
" &> {log}"

rule perform_integration:
input:
"{basedir}/{group}_merged_sce.rds"
output:
"{basedir}/{group}_integrated_sce.rds"
log: "logs/{basedir}/{group}_perform_integration.log"
conda: "envs/scpca-renv.yaml"
shell:
" Rscript --vanilla 'optional-integration-analysis/perform-integration.R'"
" --merged_sce_file {input}"
" --integration_method {config[integration_method]}"
" --fastmnn_auto_merge"
" --output_sce_file {output}"
" --project_root $PWD"
" &> {log}"

rule generate_integration_report:
input:
integrated_sce = "{basedir}/{group}_integrated_sce.rds"
output:
"{basedir}/{group}_integration_report.html"
log: "logs/{basedir}/{group}/integration_report.log"
conda: "envs/scpca-renv.yaml"
shell:
"""
Rscript --vanilla -e "
rmarkdown::render('optional-integration-analysis/integration-report-template.Rmd',
clean = TRUE,
output_file = '{output}',
output_dir = dirname('{output}'),
params = list(integration_group = '{wildcards.group}',
integrated_sce = '{input.integrated_sce}',
integration_method = '{config[integration_method]}',
batch_column = '{config[batch_column]}',
cell_id_column = '{config[cell_id_column]}'),
envir = new.env())
" &> {log}
"""
26 changes: 13 additions & 13 deletions optional-clustering-analysis/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Optional Clustering Analysis

This directory includes a clustering analysis workflow that can help users identify the optimal clustering method and parameters for each library in their dataset.
This directory includes a clustering analysis workflow that can help users identify the optimal clustering method and parameters for each library in their dataset.

**The clustering analysis workflow cannot be implemented until after users have successfully run the main downstream analysis core workflow as described in this repository's main [README.md](../README.md) file or have downloaded data from the [ScPCA portal](https://scpca.alexslemonade.org/).**

Expand Down Expand Up @@ -39,7 +39,7 @@ Additionally, metrics associated with each of the clustering results such as sil
The plots are displayed in a html report for ease of reference.

**Note** that the same [software requirements for the core workflow](../README.md#3-additional-dependencies) are also required for this clustering workflow.
R 4.2 is required for running our pipeline, along with Bioconductor 3.15.
R 4.2 is required for running our pipeline, along with Bioconductor 3.16.
Package dependencies for the analysis workflows in this repository are managed using [`renv`](https://rstudio.github.io/renv/index.html), which must be installed locally prior to running the workflow.
If you are using conda, dependencies can be installed as [part of the initial setup](../README.md#snakemakeconda-installation).

Expand All @@ -61,32 +61,32 @@ Learn more about snakemake configuration files [here](https://snakemake.readthed

The config file contains two sets of parameters:

- **[Project-specific Parameters](../config/config.yaml#L3)**: This set of parameters are for specifying dataset or project related details.
- **[Project-specific Parameters](../config/config.yaml#L3)**: This set of parameters are for specifying dataset or project related details.
These parameters are **required** to run the workflow on your data.
- **[Processing Parameters](../config/cluster_config.yaml)**: This set of parameters specify configurations for the type(s) of graph-based clustering to be performed, as well as the range of nearest neighbors values to use.
You can change them to explore your data but it is optional.
You can modify the relevant parameters by manually updating the `config/cluster_config.yaml` file using a text editor of your choice.

To run the workflow on your data, modify the following parameters in the `config/config.yaml` file:

| Parameter | Description |
|------------------|-------------|
| `input_data_dir` | full path to the directory where the input data files can be found (default will be the `results_dir` used in the core workflow) |
| `results_dir` | full path to the directory where output files will be stored |
| `project_metadata` | full path to your specific project metadata TSV file (use the same `project_metadata` used in the prerequisite core workflow) |
| Parameter | Description |
| ------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
| `input_data_dir` | full path to the directory where the input data files can be found (default will be the `results_dir` used in the core workflow) |
| `results_dir` | full path to the directory where output files will be stored |
| `project_metadata` | full path to your specific project metadata TSV file (use the same `project_metadata` used in the prerequisite core workflow) |

|[View Config File](../config/config.yaml)|
|---|
| [View Config File](../config/config.yaml) |
| ----------------------------------------- |

The [`config/cluster_config.yaml`](../config/cluster_config.yaml) file also contains additional processing parameters like the type of graph-based clustering to be performed and the nearest neighbors values that should be used.
We have set default values for these parameters.
We have set default values for these parameters.
Learn more about the [processing parameters](../additional-docs/processing-parameters.md#clustering-analysis-parameters) and how to modify them.

## Running the workflow

The execution file with the clustering Snakemake workflow is named `cluster.snakefile` and can be found in the root directory. To tell snakemake to run the specific clustering workflow be sure to use the `--snakefile` or `-s` option followed by the name of the snakefile, `cluster.snakefile`.

After you have successfully modified the required project-specific parameters in the config file and navigated to within the root directory of the `scpca-downstream-analyses` repository, you can run the clustering Snakemake workflow with just the `--cores` and `--use-conda` flags as in the following example:
After you have successfully modified the required project-specific parameters in the config file and navigated to within the root directory of the `scpca-downstream-analyses` repository, you can run the clustering Snakemake workflow with just the `--cores` and `--use-conda` flags as in the following example:

```
snakemake --snakefile cluster.snakefile --cores 2 --use-conda
Expand Down Expand Up @@ -127,7 +127,7 @@ You can also download a ZIP file with an example of the output from running the

### What to expect in the output `SingleCellExperiment` object

In the [`colData`](https://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html#handling-metadata) of the output `SingleCellExperiment` object, you can find the following:
In the [`colData`](https://.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html#handling-metadata) of the output `SingleCellExperiment` object, you can find the following:

- Clustering results stored in a metadata column named using the associated clustering type and nearest neighbours values.
For example, where `n` is a value within a range of nearest neighbors values provided to perform Louvain clustering, the column name would be `louvain_n` and can be accessed using `colData(sce)$louvain_n`.
Expand Down
Loading

0 comments on commit 60d265d

Please sign in to comment.