Easy to use pipelines for the Sentieon tools on the Google Cloud
For a tutorial, see Google's tutorial on running a Sentieon DNAseq pipeline. For more customized pipelines and additional details on the Sentieon software, please visit https://www.sentieon.com.
- Highlights
- Prerequisites
- Running a pipeline
- Recommended configurations
- Additional options - germline
- Additional options - somatic
- Where to get help
- Easily run the Sentieon Pipelines on the Google Cloud.
- Pipelines are optimized by Sentieon to be well-tuned for efficiently processing WES and WGS samples.
- All Sentieon pipelines and variant callers are available including DNAseq, DNAscope, TNseq, and TNscope
- Matching results to the GATK Germline and Somatic Best Practices Pipelines
- Automatic 14-day free-trial of the Sentieon software on the Google Cloud
- Install Python 2.7+.
- Select or create a GCP project.
- Make sure that billing is enabled for your Google Cloud Platform project.
- Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install and initialize the Cloud SDK.
- Update and install gcloud components:
gcloud components update &&
gcloud components install alpha
- Install git to download the required files.
- By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time. For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region: CPUs: 64 Persistent Disk Standard (GB): 375 You can leave other quota request fields empty to keep your current quotas.
Setup a Python virtualenv to manage the environment. First, install virtualenv if necessary
pip install --upgrade virtualenv
Install the required Python dependencies
virtualenv env
source env/bin/activate
pip install --upgrade \
pyyaml \
google-api-python-client \
google-auth \
google-cloud-storage \
google-auth-httplib2
Download the pipeline script and move into the new directory.
git clone https://github.com/sentieon/sentieon-google-genomics.git
cd sentieon-google-genomics
The runner script accepts a JSON file as input. In the repository you downloaded, there is an examples/example.json
file with the following content:
{
"FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
"FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
"OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
The following table describes the JSON keys in the file:
JSON key | Description |
---|---|
FQ1 | The first pair of reads in the input fastq file. |
FQ2 | The second pair of reads in the input fastq file. |
BAM | The input BAM file, if applicable. |
REF | The reference genome. If set, the reference index files are assumed to exist. |
OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline. |
ZONES | A comma-separated list of GCP zones to use for the worker node. |
PROJECT_ID | Your GCP project ID. |
REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets. |
Your email |
The FQ1
, FQ2
, REF
, and ZONES
fields will work with the defaults. However, the OUTPUT_BUCKET
, PROJECT_ID
, REQUESTER_PROJECT
, and EMAIL
fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.
Edit the OUTPUT_BUCKET
, PROJECT_ID
, REQUESTER_PROJECT
, and EMAIL
fields in the examples/example.json
to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the EMAIL
field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.
You after modifying the examples/example.json
file, you can use the following command to run the DNAseq pipeline on a small test dataset.
python runner/sentieon_runner.py --requester_project $PROJECT_ID examples/example.json
The --requester_project
argument will configure the software to use the specified PROJECT_ID when polling input files locally. Alternatively, you might set --no_check_inputs_exist
to skip input file polling.
If execution is successful, the runner script will print some logging information followed by Operation succeeded
to the terminal. Output files from the pipeline can then be found in the OUTPUT_BUCKET
location in Google Cloud Storage including alignment (BAM) files, variant calls, sample metrics and logging information.
In the event of run failure, some diagnostic information will be printed to the screen followed by an error message. For assistance, please send the diagnostic information along with any log files in OUTPUT_BUCKET
/worker_logs/ to [email protected].
In the examples
directory, you can find the following example configurations:
Configuration | Pipeline |
---|---|
100x_wes.json | DNAseq pipeline from FASTQ to VCF for Whole Exome Sequencing Data |
30x_wgs.json | DNAseq pipeline from FASTQ to VCF for Whole Genome Sequencing Data |
tn_example.json | TNseq pipeline from FASTQ to VCF for Tumor Normal Pairs |
Below are some recommended configurations for some common use-cases. The cost and runtime estimates below assume that jobs are run on preemptible instances that were not preempted during job execution.
The following configuration will run a 30x human genome at a cost of approximately $1.35 and will take about 2 hours. This configuration can also be used to run a 100x whole exome at a cost of approximately $0.30 and will take about 35 minutes.
{
"FQ1": "gs://my-bucket/sample1_1.fastq.gz",
"FQ2": "gs://my-bucket/sample1._2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
"OUTPUT_BUCKET": "gs://BUCKET",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "PROJECT_ID",
"EMAIL": "EMAIL",
"BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"PREEMPTIBLE_TRIES": "2",
"NONPREEMPTIBLE_TRY": true,
"STREAM_INPUT": "True",
"DISK_SIZE": 300,
"PIPELINE": "GERMLINE",
"CALLING_ALGO": "Haplotyper"
}
The CALLING_ALGO
key can be changed to DNAscope
to use the Sentieon DNAscope variant caller for improved variant calling accuracy. For large input files, DISK_SIZE
should be increased.
The following configuration will run a paired 60-30x human genome at a cost of approximately $3.70 and will take about 7 hours. This configuration can also be used to run a paired 150-150x human exome at a cost of approximately $0.60 and will take about 1.5 hours.
{
"TUMOR_FQ1": "gs://my-bucket/tumor1_1.fastq.gz",
"TUMOR_FQ2": "gs://my-bucket/tumor1_2.fastq.gz",
"FQ1": "gs://my-bucket/normal1_1.fastq.gz",
"FQ2": "gs://my-bucket/normal1._2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
"OUTPUT_BUCKET": "gs://BUCKET",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "PROJECT_ID",
"EMAIL": "EMAIL",
"BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"PREEMPTIBLE_TRIES": "2",
"NONPREEMPTIBLE_TRY": true,
"STREAM_INPUT": "True",
"DISK_SIZE": 300,
"PIPELINE": "SOMATIC",
"CALLING_ALGO": "TNhaplotyper"
}
The CALLING_ALGO
key key can be change to TNsnv
, TNhaplotyper
, TNhaplotyper2
, or TNscope
to use Sentieon's TNsnv, TNhaplotyper, TNhaplotyper2 or TNscope variant callers, respectively. For large input files, DISK_SIZE
should be increased.
JSON Key | Description |
---|---|
FQ1 | A comma-separated list of input R1 FASTQ files |
FQ2 | A comma-separated list of input R2 FASTQ files |
READGROUP | A comma-separted list of readgroups headers to add to the read data during alignment |
BAM | A comma-separated list of input BAM files |
REF | The path to the reference genome |
BQSR_SITES | A comma-separated list of known sites for BQSR |
DBSNP | A dbSNP file to use during variant calling |
INTERVAL | A string of interval(s) to use during variant calling |
INTERVAL_FILE | A file of intervals(s) to use during variant calling |
DNASCOPE_MODEL | A trained model to use during DNAscope variant calling |
JSON Key | Description |
---|---|
ZONES | GCE Zones to potentially launch the job in |
DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) |
MACHINE_TYPE | The type of GCE machine to use to run the pipeline |
JSON Key | Description |
---|---|
SENTIEON_VERSION | The version of the Sentieon software package to use |
DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) |
NO_METRICS | Skip running metrics collection |
NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file |
NO_HAPLOTYPER | Skip variant calling |
GVCF_OUTPUT | Output variant calls in gVCF format rather than VCF format |
STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage |
RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) |
CALLING_ARGS | A string of additional arguments to pass to the variant caller |
PIPELINE | Set to GERMLINE to run the germline variant calling pipeline |
CALLING_ALGO | The Sentieon variant calling algo to run. Either Haplotyper or DNAscope |
JSON Key | Description |
---|---|
OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files |
An email address to use to obtain an evaluation license for your GCP Project | |
SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
PROJECT_ID | Your GCP Project ID to use when running jobs |
REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets |
PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
NONPREEMPTIBLE_TRY | After PREEMPTIBLE_TRIES are exhausted, whether to try one additional run with standard instances |
JSON Key | Description |
---|---|
TUMOR_FQ1 | A comma-separated list of input R1 tumor FASTQ files |
TUMOR FQ2 | A comma-separated list of input R2 tumor FASTQ files |
FQ1 | A comma-separated list of input R1 normal FASTQ files |
FQ2 | A comma-separated list of input R2 normal FASTQ files |
TUMOR_READGROUP | A comma-separted list of readgroups headers to add to the tumor read data during alignment |
READGROUP | A comma-separted list of readgroups headers to add to the normal read data during alignment |
TUMOR_BAM | A comma-separated list of input tumor BAM files |
BAM | A comma-separated list of input normal BAM files |
REF | The path to the reference genome |
BQSR_SITES | A comma-separated list of known sites for BQSR |
DBSNP | A dbSNP file to use during variant calling |
INTERVAL | A string of interval(s) to use during variant calling |
INTERVAL_FILE | A file of intervals(s) to use during variant calling |
JSON Key | Description |
---|---|
ZONES | GCE Zones to potentially launch the job in |
DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) |
MACHINE_TYPE | The type of GCE machine to use to run the pipeline |
JSON Key | Description |
---|---|
SENTIEON_VERSION | The version of the Sentieon software package to use |
DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) |
NO_METRICS | Skip running metrics collection |
NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file |
NO_VCF | Skip variant calling |
STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage |
RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) |
CALLING_ARGS | A string of additional arguments to pass to the variant caller |
PIPELINE | Set to SOMATIC to run the somatic variant calling pipeline |
RUN_TNSNV | If using the TNseq pipeline, use TNsnv for variant calling |
CALLING_ALGO | The Sentieon somatic variant calling algo to run. Either TNsnv, TNhaplotyper, TNhaplotyper2, or TNscope |
JSON Key | Description |
---|---|
OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files |
An email address to use to obtain an evaluation license for your GCP Project | |
SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
PROJECT_ID | Your GCP Project ID to use when running jobs |
REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets |
PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
NONPREEMPTIBLE_TRY | After PREEMPTIBLE_TRIES are exhausted, whether to try one additional run with standard instances |
Please email [email protected] with any questions.