Repo for testing and developing a common postmortem-derived brain sequencing (PMDBS) workflow harmonized across ASAP.
Worfklows are defined in the workflows
directory.
This workflow is set up to implement the Harmony RNA snakemake workflow in WDL. The WDL version of the workflow aims to maintain backwards compatibility with the snakemake scripts. Scripts used by the WDL workflow were modified from the Harmony RNA snakemake repo; originals may be found here, and their modified R versions in the docker/multiome/scripts directory. Python versions can be found the docker/scvi/scripts directory.
Entrypoint: workflows/main.wdl
Input template: workflows/inputs.json
The workflow is broken up into two main chunks:
Run once per sample; only rerun when the preprocessing workflow version is updated. Preprocessing outputs are stored in the originating team's raw and staging data buckets.
Run once per team (all samples from a single team) if project.run_project_cohort_analysis
is set to true
, and once for the whole cohort (all samples from all teams). This can be rerun using different sample subsets; including additional samples requires this entire analysis to be rerun. Intermediate files from previous runs are not reused and are stored in timestamped directories.
An input template file can be found at workflows/inputs.json.
Type | Name | Description |
---|---|---|
String | cohort_id | Name of the cohort; used to name output files during cross-team cohort analysis. |
Array[Project] | projects | The project ID, set of samples and their associated reads and metadata, output bucket locations, and whether or not to run project-level cohort analysis. |
File | cellranger_reference_data | Cellranger transcriptome reference data; see https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest. |
Float? | cellbender_fpr | Cellbender false positive rate for signal removal. [0.0] |
Boolean? | run_cross_team_cohort_analysis | Whether to run downstream harmonization steps on all samples across projects. If set to false, only preprocessing steps (cellranger and generating the initial adata object(s)) will run for samples. [false] |
String | cohort_raw_data_bucket | Bucket to upload cross-team cohort intermediate files to. |
String | cohort_staging_data_bucket | Bucket to upload cross-team cohort analysis outputs to. |
Int? | n_top_genes | Number of HVG genes to keep. [8000] |
String? | scvi_latent_key | Latent key to save the scVI latent to. ['X_scvi'] |
String? | batch_key | Key in AnnData object for batch information. ['batch_id'] |
File | cell_type_markers_list | CSV file containing a list of major cell type markers; used to annotate cells. |
Array[String]? | groups | Groups to produce umap plots for. ['sample', 'batch', 'cell_type'] |
Array[String]? | features | Features to produce umap plots for. ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_rb', 'doublet_score', 'S_score', 'G2M_score'] |
String | container_registry | Container registry where workflow Docker images are hosted. |
Type | Name | Description |
---|---|---|
String | project_id | Unique identifier for project; used for naming output files |
Array[Sample] | samples | The set of samples associated with this project |
Boolean | run_project_cohort_analysis | Whether or not to run cohort analysis within the project |
String | raw_data_bucket | Raw data bucket; intermediate output files that are not final workflow outputs are stored here |
String | staging_data_bucket | Staging data bucket; final project-level outputs are stored here |
Type | Name | Description |
---|---|---|
String | sample_id | Unique identifier for the sample within the project |
String? | batch | The sample's batch. If unset, the analysis will stop after running cellranger_count . |
File | fastq_R1 | Path to the sample's read 1 FASTQ file |
File | fastq_R2 | Path to the sample's read 2 FASTQ file |
File? | fastq_I1 | Optional fastq index 1 |
File? | fastq_I2 | Optional fastq index 2 |
The inputs JSON may be generated manually, however when running a large number of samples, this can become unwieldly. The generate_inputs
utility script may be used to automatically generate the inputs JSON. The script requires the libraries outlined in the requirements.txt file and the following inputs:
project-tsv
: One or more project TSVs with one row per sample and columns project_id, sample_id, batch, fastq_path. All samples from all projects may be included in the same project TSV, or multiple project TSVs may be provided.project_id
: A unique identifier for the project from which the sample(s) arosesample_id
: A unique identifier for the sample within the projectbatch
: The sample's batchfastq_path
: The directory in which paired sample FASTQs may be found, including the gs:// bucket name and path
fastq-locs-txt
: FASTQ locations for all samples provided in theproject-tsv
, one per line. Each sample is expected to have one set of paired fastqs located at${fastq_path}/${sample_id}*
. The read 1 file should include 'R1' somewhere in the filename; the read 2 file should inclue 'R2' somewhere in the filename. Generate this file e.g. by runninggsutil ls gs://fastq_bucket/some/path/**.fastq.gz >> fastq_locs.txt
inputs-template
: The inputs template JSON file into which theprojects
information derived from theproject-tsv
will be inserted. Must have a key ending in*.projects
. Other default values filled out in the inputs template will be written to the output inputs.json file.run-project-cohort-analysis
: Optionally run project-level cohort analysis for provided projects. This value will apply to all projcets. [false]output-file
: Optional output file name. [inputs.json]
Example usage:
./util/generate_inputs \
--project-tsv sample_info.tsv \
--fastq-locs-txt fastq_locs.txt \
--inputs-template workflows/inputs.json \
--run-project-cohort-analysis \
--output-file harmony_workflow_inputs.json
cohort_id
: either theproject_id
for project-level cohort analysis, or thecohort_id
for the full cohortworkflow_run_timestamp
: format:%Y-%m-%dT%H-%M-%SZ
- The list of samples used to generate the cohort analysis will be output alongside other cohort analysis outputs in the staging data bucket (
${cohort_id}.sample_list.tsv
) - The MANIFEST.tsv file in the staging data bucket describes the workflow name, version, and timestamp for the run used to generate each file in that directory
The raw data bucket will contain some artifacts generated as part of workflow execution. Following successful workflow execution, the artifacts will also be copied into the staging bucket as final outputs.
In the workflow, task outputs are either specified as String
(final outputs, which will be copied in order to live in raw data buckets and staging buckets) or File
(intermediate outputs that are periodically cleaned up, which will live in the cromwell-output bucket). This was implemented to reduce storage costs. Preprocess final outputs are defined in the workflow at main.wdl and cohort_analysis.wdl, and cohort analysis final outputs are defined at cohort_analysis.wdl.
asap-raw-data-{cohort,team-xxyy}
└── workflow_execution
├── cohort_analysis
│ └──${cohort_analysis_workflow_version}
│ └── ${workflow_run_timestamp}
│ └── <cohort outputs>
└── preprocess // only produced in project raw data buckets, not in the full cohort bucket
├── cellranger
│ └── ${cellranger_task_version}
│ └── <cellranger output>
├── remove_technical_artifacts
│ └── ${preprocess_workflow_version}
│ └── <remove_technical_artifacts output>
└── counts_to_adata
└── ${preprocess_workflow_version}
└── <counts_to_adata output>
Staging data (intermediate workflow objects and final workflow outputs for the latest run of the workflow)
Following QC by researchers, the objects in the dev or uat bucket are synced into the curated data buckets, maintaining the same file structure. Curated data buckets are named asap-curated-data-{cohort,team-xxyy}
.
Data may be synced using the promote_staging_data
script.
asap-dev-data-{cohort,team-xxyy}
├── cohort_analysis
│ ├── ${cohort_id}.sample_list.tsv
│ ├── ${cohort_id}.doublet_score.violin.png
│ ├── ${cohort_id}.n_genes_by_counts.violin.png
│ ├── ${cohort_id}.pct_counts_mt.violin.png
│ ├── ${cohort_id}.pct_counts_rb.violin.png
│ ├── ${cohort_id}.total_counts.violin.png
│ ├── ${cohort_id}.validation_metrics.csv
│ ├── ${cohort_id}.cell_types.csv
│ ├── ${cohort_id}.annotate_cells.metadata.csv
│ ├── ${cohort_id}.harmony_integrated.h5ad
│ ├── ${cohort_id}.scib_report.csv
│ ├── ${cohort_id}.features.umap.png
│ ├── ${cohort_id}.groups.umap.png
│ └── MANIFEST.tsv
└── preprocess
├── ${cohort_id}.scvi_model.tar.gz
├── ${sampleA_id}.filtered_feature_bc_matrix.h5
├── ${sampleA_id}.metrics_summary.csv
├── ${sampleA_id}.molecule_info.h5
├── ${sampleA_id}.raw_feature_bc_matrix.h5
├── ${sampleA_id}.cellbender_report.html
├── ${sampleA_id}.cellbender_metrics.csv
├── ${sampleA_id}.cellbender_filtered.h5
├── ${sampleA_id}.cellbender_ckpt.tar.gz
├── ${sampleA_id}.cellbender_cell_barcodes.csv
├── ${sampleA_id}.cellbender.pdf
├── ${sampleA_id}.cellbender.log
├── ${sampleA_id}.cellbender.h5
├── ${sampleA_id}.cellbend_posterior.h5
├── ${sampleA_id}.adata_object.h5ad
├── ${sampleB_id}.filtered_feature_bc_matrix.h5
├── ${sampleB_id}.metrics_summary.csv
├── ${sampleB_id}.molecule_info.h5
├── ${sampleB_id}.raw_feature_bc_matrix.h5
├── ${sampleB_id}.cellbender_report.html
├── ${sampleB_id}.cellbender_metrics.csv
├── ${sampleB_id}.cellbender_filtered.h5
├── ${sampleB_id}.cellbender_ckpt.tar.gz
├── ${sampleB_id}.cellbender_cell_barcodes.csv
├── ${sampleB_id}.cellbender.pdf
├── ${sampleB_id}.cellbender.log
├── ${sampleB_id}.cellbender.h5
├── ${sampleB_id}.cellbend_posterior.h5
├── ${sampleB_id}.adata_object.h5ad
├── ...
├── ${sampleN_id}.filtered_feature_bc_matrix.h5
├── ${sampleN_id}.metrics_summary.csv
├── ${sampleN_id}.molecule_info.h5
├── ${sampleN_id}.raw_feature_bc_matrix.h5
├── ${sampleN_id}.cellbender_report.html
├── ${sampleN_id}.cellbender_metrics.csv
├── ${sampleN_id}.cellbender_filtered.h5
├── ${sampleN_id}.cellbender_ckpt.tar.gz
├── ${sampleN_id}.cellbender_cell_barcodes.csv
├── ${sampleN_id}.cellbender.pdf
├── ${sampleN_id}.cellbender.log
├── ${sampleN_id}.cellbender.h5
├── ${sampleN_id}.cellbend_posterior.h5
├── ${sampleN_id}.adata_object.h5ad
└── MANIFEST.tsv
The promote_staging_data
script can be used to promote staging data that has been approved to the curated data bucket for a team or set of teams.
This script rsync all files in the staging bucket to the curated bucket's preprocess and cohort_analysis directories. Exercise caution when using this script; files that are not present in the source (staging) bucket will be deleted at the destination (curated) bucket.
The script defaults to a dry run, printing out the files that would be copied or deleted for each selected team.
-h Display this message and exit
-t Comma-separated set of teams to promote data for
-a Promote all teams' data
-l List available teams
-p Promote data. If this option is not selected, data that would be copied or deleted is printed out, but files are not actually changed (dry run)
-s Staging bucket type; options are 'uat' or 'dev' ['uat']
# List available teams
./util/promote_staging_data -l
# Print out the files that would be copied or deleted from the staging bucket to the curated bucket for teams team-hafler, team-lee, and cohort
./util/promote_staging_data -t team-hafler,team-lee,cohort
# Promote data for team-hafler, team-hardy, team-jakobsson, team-lee, team-scherzer, team-sulzer, and cohort
./util/promote_staging_data -a -p -s dev
# Print out the files that would be copied or deleted from the staging bucket to the curated bucket for unembargoed cohort (team-hafler, team-lee, team-jakobsson, and team-scherzer)
./util/promote_staging_data -t cohort
Docker images are defined in the docker
directory. Each image must minimally define a build.env
file and a Dockerfile
.
Example directory structure:
docker
├── scvi
│ ├── build.env
│ └── Dockerfile
└── samtools
├── build.env
└── Dockerfile
Each target image is defined using the build.env
file, which is used to specify the name and version tag for the corresponding Docker image. It must contain at minimum the following variables:
IMAGE_NAME
IMAGE_TAG
All variables defined in the build.env
file will be made available as build arguments during Docker image build.
The DOCKERFILE
variable may be used to specify the path to a Dockerfile if that file is not found alongside the build.env
file, for example when multiple images use the same base Dockerfile definition.
Docker images can be build using the build_docker_images
utility script.
# Build a single image
./util/build_docker_images -d docker/scvi
# Build all images in the `docker` directory
./util/build_docker_images -d docker
# Build and push all images in the docker directory, using the `dnastack` container registry
./util-build_docker_images -d docker -c dnastack -p