From 4897a1b05b6531e2a9cbb2bb2e35b47517ce1542 Mon Sep 17 00:00:00 2001 From: TCLamnidis Date: Wed, 2 Oct 2019 16:37:30 +0200 Subject: [PATCH 1/6] Added sex_deterrmine process. Fixed minor bug with skip_mapping for bwa process. --- CHANGELOG.md | 1 + README.md | 1 + docs/output.md | 2 ++ docs/usage.md | 12 ++++++++++++ main.nf | 50 +++++++++++++++++++++++++++++++++++++++++++++++--- 5 files changed, 63 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f13b0e0df..0bcab09c4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0. * Added Support for automated tests using [GitHub Actions](https://github.com/features/actions) * Added genotyping capability through GATK UnifiedGenotyper (v3.5), GATK HaplotypeCaller (v4.1) and FreeBayes +* Added human sex determination module. ### `Fixed` diff --git a/README.md b/README.md index e9aebd88f..0fd66ce2b 100644 --- a/README.md +++ b/README.md @@ -115,3 +115,4 @@ If you've contributed and you're missing in here, please let me know and I'll ad * **FastP** Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. [https://doi.org/10.1093/bioinformatics/bty560](https://doi.org/10.1093/bioinformatics/bty560) Download: [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp) * **GATK 3.8** DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., … Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. [https://doi.org/10.1038/ng.806](https://doi.org/10.1038/ng.806.) [Download](https://software.broadinstitute.org/gatk/download/) * **GATK 4.X** - no citation available yet +* **Sex.DetERRmine.py** Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: [http://dx.doi.org/10.1038/s41467-018-07483-5](http://dx.doi.org/10.1038/s41467-018-07483-5). Download: [https://github.com/TCLamnidis/Sex.DetERRmine.git](https://github.com/TCLamnidis/Sex.DetERRmine.git). diff --git a/docs/output.md b/docs/output.md index e5188e1cc..86505aa70 100644 --- a/docs/output.md +++ b/docs/output.md @@ -410,3 +410,5 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir * `pmdtools/` this contains raw output statistics of pmdtools (estimates of frequencies of subsititutions), and BAM files which have been filtered to remove reads that do not have a Post-mortem damage (PMD) score of `--pmdtools_threshold`. The BAM files do not have corresponding index files. * `trimmed_bam/` this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_left` and `--bamutils_clip_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). * `genotyping/` this contains all the (gzipped) VCF files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have VCF files corresponding to your deduplicated BAM files, as well as any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). +* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. + diff --git a/docs/usage.md b/docs/usage.md index 80f6376e4..ed29024a2 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -668,6 +668,18 @@ Specify to skip over regions of high depth by discarding alignments overlapping Specify ploidy of sample in FreeBayes. Default is 2, diploid. +## Sex Determination + +An optional process for human DNA. It can be used to calculate the relative coverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard errors for these measurements are also calculated, assuming a binomial distribution of reads across the SNPs. + +### `--run_sexdeterrmine` + +Specify to run the optional process of sex determination. + +### `--sexdeterrmine_bedfile` + +Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars unstrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240K panel in mind. + ## Automatic Resubmission By default, if a pipeline step fails, EAGER2 will resubmit the job with twice the amount of CPU and memory. This will occur two times before failing. diff --git a/main.nf b/main.nf index 30d3e9599..64ab1e4bf 100644 --- a/main.nf +++ b/main.nf @@ -127,6 +127,10 @@ def helpMessage() { --gatk_ug_genotype_model Specify UnifiedGenotyper likelihood model. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'. Default: 'SNP'. --gatk_hc_emitrefconf Specify HaplotypeCaller mode for emitting reference confidence calls . Options: 'NONE', 'BP_RESOLUTION', 'GVCF'. Default: 'GVCF'. + Sex Determination + --run_sexdeterrmine Turn on sex determination. + --sexdeterrmine_bedfile Specify SNP panel in bed format for error bar calculation. (Optional, see documentation) + Other options: --outdir The output directory where the results will be saved --email Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits @@ -259,6 +263,10 @@ params.gatk_downsample = '250' params.gatk_out_mode = 'EMIT_VARIANTS_ONLY' params.gatk_dbsnp = '' +//Sex.DetERRmine settings +params.run_sexdeterrmine = false +params.sexdeterrmine_bedfile = '' + multiqc_config = file(params.multiqc_config) output_docs = file("$baseDir/docs/output.md") where_are_my_files = file("$baseDir/assets/where_are_my_files.txt") @@ -842,7 +850,7 @@ process bwa { tag "${name}" publishDir "${params.outdir}/mapping/bwa", mode: 'copy' - when: !params.circularmapper && !params.bwamem || !params.skip_mapping + when: !params.circularmapper && params.bwamem && !params.skip_mapping input: set val(name), file(reads) from ch_adapteremoval_for_bwa @@ -1251,7 +1259,7 @@ process markDup{ if (!params.skip_deduplication) { ch_filtering_for_skiprmdup.mix(ch_output_from_dedup, ch_output_from_markdup) .filter { it =~/.*_rmdup.bam/ } - .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_damageprofiler; ch_rmdup_for_qualimap; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils } + .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_damageprofiler; ch_rmdup_for_qualimap; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_for_sexdeterrmine } ch_filteringindex_for_skiprmdup.mix(ch_outputindex_from_dedup, ch_outputindex_from_markdup) .filter { it =~/.*_rmdup.bam.bai|.*_rmdup.bam.csi/ } @@ -1259,7 +1267,7 @@ if (!params.skip_deduplication) { } else { ch_filtering_for_skiprmdup - .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_damageprofiler; ch_rmdup_for_qualimap; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils } + .into { ch_rmdup_for_skipdamagemanipulation; ch_rmdup_for_damageprofiler; ch_rmdup_for_qualimap; ch_rmdup_for_pmdtools; ch_rmdup_for_bamutils; ch_for_sexdeterrmine } ch_filteringindex_for_skiprmdup .into { ch_rmdupindex_for_skipdamagemanipulation; ch_rmdupindex_for_damageprofiler; ch_rmdupindex_for_qualimap; ch_rmdupindex_for_pmdtools; ch_rmdupindex_for_bamutils } @@ -1584,6 +1592,42 @@ ch_gatk_download = Channel.value("download") } + /* + Step XX Sex determintion with error bar calculation. + */ + + process sex_deterrmine{ + publishDir "${params.outdir}/sex_determination", mode:"copy" + + when: + params.run_sexdeterrmine + + input: + val 'Bams' from ch_for_sexdeterrmine.collect() + + output: + file 'SexDet.txt' + + script: + if (params.sexdeterrmine_bedfile == '') { + """ + for i in ${Bams.join(' ')}; do + echo \$i >> bamlist.txt + done + + samtools depth -aa -q30 -Q30 -f bamlist.txt| sexdeterrmine -f bamlist.txt >SexDet.txt + """ + } else { + """ + for i in ${Bams.join(' ')}; do + echo \$i >> bamlist.txt + done + + samtools depth -aa -q30 -Q30 -b ${params.sexdeterrmine_bedfile} -f bamlist.txt| sexdeterrmine -f bamlist.txt >SexDet.txt + """ + } + } + /* Genotyping tools: - angsd From 79ad25fc19102aa470c0dd61eb6eb411bbadb65b Mon Sep 17 00:00:00 2001 From: TCLamnidis Date: Wed, 2 Oct 2019 16:47:54 +0200 Subject: [PATCH 2/6] Linting fixed. --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index ed29024a2..2d01228cf 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -670,7 +670,7 @@ Specify ploidy of sample in FreeBayes. Default is 2, diploid. ## Sex Determination -An optional process for human DNA. It can be used to calculate the relative coverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard errors for these measurements are also calculated, assuming a binomial distribution of reads across the SNPs. +An optional process for human DNA. It can be used to calculate the relative coverage of X and Y chromosomes compared to the autosomes (X-/Y-rate). Standard errors for these measurements are also calculated, assuming a binomial distribution of reads across the SNPs. ### `--run_sexdeterrmine` From ea9c1437c01c92dc542409e7c0f02f3ce52b70c0 Mon Sep 17 00:00:00 2001 From: TCLamnidis Date: Wed, 2 Oct 2019 17:24:17 +0200 Subject: [PATCH 3/6] Fixed bug in while clause for fastqc process. --- main.nf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/main.nf b/main.nf index 64ab1e4bf..f109c08b9 100644 --- a/main.nf +++ b/main.nf @@ -687,7 +687,7 @@ process fastqc { saveAs: {filename -> filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename"} when: - !params.bam && !params.skip_fastqc || params.bam && !params.run_convertbam + !params.bam && !params.skip_fastqc || params.bam && params.run_convertbam input: set val(name), file(reads) from ch_convertbam_for_fastqc From 46d643022051371bd251eb7913b6d9b9f06480c1 Mon Sep 17 00:00:00 2001 From: TCLamnidis Date: Wed, 2 Oct 2019 17:39:07 +0200 Subject: [PATCH 4/6] Added a missing } --- main.nf | 1 + 1 file changed, 1 insertion(+) diff --git a/main.nf b/main.nf index 449a2a4b8..4c3d2210c 100644 --- a/main.nf +++ b/main.nf @@ -1707,6 +1707,7 @@ if (params.additional_vcf_files == '') { samtools depth -aa -q30 -Q30 -b ${params.sexdeterrmine_bedfile} -f bamlist.txt| sexdeterrmine -f bamlist.txt >SexDet.txt """ } + } /* Genotyping tools: From 51ed357c4ecea425b52dd2c90a435bda7d1098c8 Mon Sep 17 00:00:00 2001 From: "Thiseas C. Lamnidis" Date: Fri, 4 Oct 2019 11:05:02 +0200 Subject: [PATCH 5/6] Update output.md Added comment about error bars being unreliable when 'sexdeterrmine_bedfile' is not provided. --- docs/output.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/output.md b/docs/output.md index e43110656..a3878c0f9 100644 --- a/docs/output.md +++ b/docs/output.md @@ -411,4 +411,4 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir * `trimmed_bam/` this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_left` and `--bamutils_clip_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). * `genotyping/` this contains all the (gzipped) VCF files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have VCF files corresponding to your deduplicated BAM files, as well as any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). * `MultiVCFAnalyzer` this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files. -* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. +* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer. From b4a8e0224bfe7214b4677542a727fdf1845b5d6b Mon Sep 17 00:00:00 2001 From: "Thiseas C. Lamnidis" Date: Fri, 4 Oct 2019 11:46:36 +0200 Subject: [PATCH 6/6] Linting --- docs/output.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/output.md b/docs/output.md index a3878c0f9..c06769968 100644 --- a/docs/output.md +++ b/docs/output.md @@ -411,4 +411,4 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir * `trimmed_bam/` this contains the BAM files with X number of bases trimmed off as defined with the `--bamutils_clip_left` and `--bamutils_clip_right` flags and corresponding index files. You can use these BAM files for downstream analysis such as re-mapping data with more stringent parameters (if you set trimming to remove the most likely places containing damage in the read). * `genotyping/` this contains all the (gzipped) VCF files produced by your genotyping module. The file suffix will have the genotyping tool name. You will have VCF files corresponding to your deduplicated BAM files, as well as any turned-on downstream processes that create BAMs (e.g. trimmed bams or pmd tools). * `MultiVCFAnalyzer` this contains all output from MultiVCFAnalyzer, including SNP calling statistics, various SNP table(s) and FASTA alignment files. -* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer. +* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.