Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSL2: add reference masking for PMDtools #1030

Merged
merged 8 commits into from
Dec 8, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,12 @@ process {
]
}

withName: 'GUNZIP_PMDFASTA|GUNZIP_PMDSNP|GUNZIP_SNPBED' {
publishDir = [
enabled: false
]
}

withName: SAMTOOLS_FAIDX {
publishDir = [
path: { "${params.outdir}/reference/${meta.id}/" },
Expand Down Expand Up @@ -656,6 +662,16 @@ process {
//
// DAMAGE MANIPULATION
//

withName: BEDTOOLS_MASKFASTA {
ext.prefix = { "${meta.id}.masked" }
publishDir = [
path: { "${params.outdir}/damage_manipulation/" },
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
mode: params.publish_dir_mode,
pattern: '*.masked.fa'
]
}

withName: MAPDAMAGE2 {
tag = { "${meta.reference}|${meta.sample_id}_${meta.library_id}" }
ext.args = { [
Expand Down
44 changes: 44 additions & 0 deletions docs/development/dev_docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Documentation and how tos for developing eager

## How to add new input files and options to the reference sheet

To add new input files or option to the reference sheet, you have to complete all the following steps.

### Single-reference input workflow

1. Add your new parameter to nextflow.config.
2. Add parameter description to schema (nf-core schema build).
3. Read in new parameter (params.NEW) as input within the reference_indexing_single local subworkflow.
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
1. Add new line to the large .map{} operation starting on line 80 and add check if the file exists.
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
`def PARAM_NAME = params.NEW != null ? file(params.NEW, checkIfExists: true ) : ""`
2. Add PARAM_NAME to the result of the map operation. Double-check the order!
3. With the ch_ref_index_single.multiMap{} below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step.
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
`NEW_SUBCHANNEL: [ meta, PARAM_NAME ]`
4. Add your ch_ref_index_single.NEW_SUBCHANNEL to the final emit.
`NEW_EMIT = ch_ref_index_single.NEW_SUBCHANNEL`

### Multi-reference input workflow

1. Add new column named SOFTWARE_FILETYPE and test data to the test reference sheet (https://github.com/nf-core/test-datasets/blob/eager/reference/reference_sheet_multiref.csv).
2. Read in new input within the reference_indexing_multi local subworkflow.
1. Add new line to the large .map{} operation starting on line 30. Add check if the file exists if appropriate.
`def PARAM_NAME = row["SOFTWARE_FILETYPE"] != "" ? file(row["SOFTWARE_FILETYPE"], checkIfExists: true) : ""`
2. Add PARAM_NAME to the result of the map operation. Double-check the order!
3. With the ch_input_from_referencesheet.multiMap{} below you add the reference name as a meta. You can also combine your new parameter with others if useful for the workflow step.
`NEW_SUBCHANNEL: [ meta, PARAM_NAME ]`
4. Add ch_input_from_referencesheet.NEW_SUBCHANNEL to the final emit.
`NEW_EMIT = ch_input_from_referencesheet.NEW_SUBCHANNEL`

### Combining in the Reference Indexing workflow

1. Add you new parameter channel to the if condition selecting between the direct parameter input or the reference sheet input.
1. below line 25 for reference sheet input
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
`ch_NEW_CHANNEL = REFERENCE_INDEXING_MULTI.out.NEW_EMIT`
2. below "REFERENCE_INDEXING_SINGLE"
`ch_NEW_CHANNEL = REFERENCE_INDEXING_SINGLE.out.NEW_EMIT`
3. Filter out options that have not been provided.
`ch_NEW_CHANNEL = ch_NEW_CHANNEL.filter{ it[1] != "" }`
4. Add unzipping of zipped input files with GUNZIP.
5. Add ch_NEW_CHANNEL to the final emit.
`NEW_EMIT = ch_NEW_CHANNEL`
6. Call new inputs within the main eager.nf with `REFERENCE_INDEXING.out.NEW_EMIT`.
14 changes: 14 additions & 0 deletions docs/development/manual_tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ Tool Specific combinations

- with default parameters
- with stricter threshold
- with fasta masking
- with fasta masking for 1 of 2 references

- BAM trimming

Expand Down Expand Up @@ -615,6 +617,18 @@ nextflow run . -profile test,docker --run_pmd_filtering -resume --outdir ./resul
# JK2802_JK2802_AGAATAACCTACCA_pmdfiltered.bam: 30
```

```bash
## PMD filtering with fasta masking
## Expect: damage_manipulation directory with *.masked.fa and bam and bai and flagstat per library
nextflow run . -profile test_humanbam,docker --run_pmd_filtering --damage_manipulation_pmdtools_reference_mask https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz -resume --outdir ./results
```

```bash
## PMD filtering with fasta masking for 1 of 2 references
## Expect: damage_manipulation directory with hs37d5_chr21-MT.masked.fa and bam and bai and flagstat per library and reference (22 files total). hs37d5_chr21-MT first masked with 1240K.pos.list_hs37d5.0based.bed.gz from reference sheet, PMD filtering run with masked reference fasta for hs37d5 and non-masked reference fasta for Mammoth_MT
nextflow run . -profile test_multiref,docker --run_pmd_filtering --outdir ./results
```

## BAM trimming

```bash
Expand Down
3 changes: 3 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -464,11 +464,14 @@ Be advised that this process introduces reference bias in the resulting rescaled

- `*_pmdfiltered.bam`: Reads aligned to a reference genome that pass the post-mortem-damage threshold, in BAM format.
- `*_pmdfiltered.bam.{bai,csi}`: Index file corresponding to the BAM file.
- `*_pmdfiltered.flagstat`: Statistics of aligned reads after from SAMtools `flagstat`, after filtering for post-mortem damage.
- `*.masked.fa`: _Only present if a BED file has been supplied with `--damage_manipulation_pmdtools_reference_mask` or within a reference sheet._ Reference FASTA file with positions from the BED file masked with BEDTools `maskfasta`.
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved

</details>

[pmdtools](https://github.com/pontussk/PMDtools) implements a likelihood framework incorporating a postmortem damage (PMD) score, base quality scores and biological polymorphism to identify degraded DNA sequences that are unlikely to originate from modern contamination. Using the model, each sequence is assigned a PMD score, for which positive values indicate support for the sequence being genuinely ancient.
By filtering a BAM file for damaged reads in this way, it is possible to ameliorate the effects of present-day contamination, and isolate sequences of likely ancient origin to be used downstream.
By default, all positions are assessed for damage, but it is possible to provide a FASTA file masked for specific references (`--damage_manipulation_pmdtools_masked_reference`) or a BED file to mask the reference FASTA within nf-core/eager (`--damage_manipulation_pmdtools_reference_mask`). This can alleviate reference by when running PMD filtering on capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition.
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved

### Base Trimming

Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
"installed_by": ["modules"]
},
"bedtools/maskfasta": {
"branch": "master",
"git_sha": "516189e968feb4ebdd9921806988b4c12b4ac2dc",
"installed_by": ["modules"]
},
"bowtie2/align": {
"branch": "master",
"git_sha": "603ecbd9f45300c9788f197d2a15a005685b4220",
Expand Down
6 changes: 6 additions & 0 deletions modules/nf-core/bedtools/maskfasta/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 36 additions & 0 deletions modules/nf-core/bedtools/maskfasta/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

46 changes: 46 additions & 0 deletions modules/nf-core/bedtools/maskfasta/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ params {
damage_manipulation_rescale_length_3p = 0
run_pmd_filtering = false
damage_manipulation_pmdtools_threshold = 3
damage_manipulation_pmdtools_masked_reference = null
damage_manipulation_pmdtools_reference_mask = null
run_trim_bam = false
damage_manipulation_bamutils_trim_double_stranded_none_udg_left = 0
Expand Down
28 changes: 16 additions & 12 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -767,12 +767,20 @@
"description": "Specify PMDScore threshold for PMDtools.",
"help_text": "Specifies the PMDScore threshold to use in the pipeline when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream DNA analysis.\n\n> Modifies PMDtools parameter: `--threshold`"
},
"damage_manipulation_pmdtools_masked_reference": {
"type": "string",
"fa_icon": "fas fa-mask",
"help_text": "Supplying a FASTA file to this parameter will result in filtering not conducted for the masked positions. /nThis can alleviate reference bias when running PMD filtering on capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition.",
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
"description": "Specify a FASTA file with positions to be used with pmdtools.",
"pattern": "^\\S+\\.fa?(\\sta)$",
"format": "file-path"
},
"damage_manipulation_pmdtools_reference_mask": {
"type": "string",
"fa_icon": "fas fa-mask",
"help_text": "Supplying a bedfile to this parameter activates masking of the reference fasta at the contained sites prior to running PMDtools. Positions that are in the provided bedfile will be replaced by Ns in the reference genome. \nThis can alleviate reference bias when running PMD filtering on capture data, where you might not want the allele of a transition SNP to be counted as damage. Masking of the reference is done using `bedtools maskfasta`.",
"description": "Specify a bedfile to be used to mask the reference fasta prior to running pmdtools.",
"pattern": "^\\S+\\.bed$",
"pattern": "^\\S+\\.bed?(\\.gz)$",
"format": "file-path"
},
"run_trim_bam": {
Expand Down Expand Up @@ -947,8 +955,7 @@
"fa_icon": "fab fa-creative-commons-sampling-plus"
},
"skip_qualimap": {
"type": "boolean",
"default": "false"
"type": "boolean"
},
"snpcapture_bed": {
"type": "string",
Expand Down Expand Up @@ -1118,9 +1125,6 @@
{
"$ref": "#/definitions/mapping"
},
{
"$ref": "#/definitions/adna_damage_analysis"
},
{
"$ref": "#/definitions/bam_filtering"
},
Expand All @@ -1131,28 +1135,28 @@
"$ref": "#/definitions/deduplication"
},
{
"$ref": "#/definitions/mitochondrial_to_nuclear_ratio"
"$ref": "#/definitions/damage_manipulation"
},
{
"$ref": "#/definitions/mapping_statistics"
"$ref": "#/definitions/genotyping"
},
{
"$ref": "#/definitions/damage_manipulation"
"$ref": "#/definitions/mitochondrial_to_nuclear_ratio"
},
{
"$ref": "#/definitions/genotyping"
"$ref": "#/definitions/mapping_statistics"
},
{
"$ref": "#/definitions/adna_damage_analysis"
},
{
"$ref": "#/definitions/contamination_estimation"
"$ref": "#/definitions/feature_annotation_statistics"
},
{
"$ref": "#/definitions/host_removal"
},
{
"$ref": "#/definitions/feature_annotation_statistics"
"$ref": "#/definitions/contamination_estimation"
}
]
}
64 changes: 50 additions & 14 deletions subworkflows/local/manipulate_damage.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
// Calculate PMD scores, trim, or rescale DNA damage from mapped reads.
//

include { BEDTOOLS_MASKFASTA } from '../../modules/nf-core/bedtools/maskfasta/main'
include { MAPDAMAGE2 } from '../../modules/nf-core/mapdamage2/main'
include { PMDTOOLS_FILTER } from '../../modules/nf-core/pmdtools/filter/main'
include { BAMUTIL_TRIMBAM } from '../../modules/nf-core/bamutil/trimbam/main'
Expand All @@ -13,8 +14,9 @@ include { SAMTOOLS_FLAGSTAT as SAMTOOLS_FLAGSTAT_DAMAGE_FILTERED } from '../../
// TODO: Add required channels and channel manipulations for reference-dependent bed masking before pmdtools. Requires multi-ref support before implementation.
workflow MANIPULATE_DAMAGE {
take:
ch_bam_bai // [ [ meta ], bam , bai ]
ch_fasta // [ [ meta ], fasta ]
ch_bam_bai // [ [ meta ], bam , bai ]
ch_fasta // [ [ meta ], fasta ]
ch_pmd_masking // [ [ meta ], masked_fasta, bed_for_masking ]

main:
ch_versions = Channel.empty()
Expand All @@ -35,7 +37,7 @@ workflow MANIPULATE_DAMAGE {
// Prepend a new meta that contains the meta.reference value as the new_meta.reference attribute
WorkflowEager.addNewMetaFromAttributes( it, "reference" , "reference" , false )
}
.combine(ch_refs, by: 0 ) // [ [combine_meta], [meta], bam, bai, [ref_meta], fasta ]
.combine( ch_refs, by: 0 ) // [ [combine_meta], [meta], bam, bai, [ref_meta], fasta ]

if ( params.run_mapdamage_rescaling ) {
ch_mapdamage_prep = ch_input_for_damage_manipulation
Expand Down Expand Up @@ -71,17 +73,51 @@ workflow MANIPULATE_DAMAGE {
}

if ( params.run_pmd_filtering ) {
// TODO Add module to produce Masked reference from given references and bed file (with meta specifying the reference it matches)?
// if ( params.pmdtools_reference_mask) {
// MASK_REFERENCE_BY_BED()
// }

ch_pmdtools_input = ch_input_for_damage_manipulation
.multiMap {
ignore_me, meta, bam, bai, ref_meta, fasta ->
bam: [ meta, bam, bai ]
fasta: fasta
}
ch_masking_prep = ch_pmd_masking
.combine( ch_fasta, by: 0 ) // [ [meta], masked_fasta, bed, fasta ]
.branch {
meta, masked_fasta, bed, fasta ->
alreadymasked: masked_fasta != ""
tobemasked: masked_fasta == "" && bed != ""
nomasking: masked_fasta == "" && bed == ""
}

ch_masking_input = ch_masking_prep.tobemasked
.multiMap{
meta, masked_fasta, bed, fasta ->
bed: [ meta, bed ]
fasta: fasta
}

BEDTOOLS_MASKFASTA( ch_masking_input.bed, ch_masking_input.fasta )
ch_masking_output = BEDTOOLS_MASKFASTA.out.fasta

ch_already_masked = ch_masking_prep.alreadymasked
.map {
meta, masked_fasta, bed, fasta ->
[ meta, masked_fasta ]
}

ch_no_masking = ch_masking_prep.nomasking
.map {
meta, masked_fasta, bed, fasta ->
[ meta, fasta ]
}

ch_pmd_fastas = ch_masking_output.concat( ch_already_masked, ch_no_masking )
scarlhoff marked this conversation as resolved.
Show resolved Hide resolved
.map {
// Prepend a new meta that contains the meta.id value as the new_meta.reference attribute
WorkflowEager.addNewMetaFromAttributes( it, "id" , "reference" , false )
}

ch_pmdtools_input = ch_bam_bai
.map { WorkflowEager.addNewMetaFromAttributes( it, "reference" , "reference" , false ) }
.combine( ch_pmd_fastas, by: 0 ) // [ [combine_meta], [meta], bam, bai, [ref_meta] fasta ]
.multiMap {
combine_meta, meta, bam, bai, ref_meta, fasta ->
bam: [ meta, bam, bai ]
fasta: fasta
}

PMDTOOLS_FILTER( ch_pmdtools_input.bam, params.damage_manipulation_pmdtools_threshold, ch_pmdtools_input.fasta )
ch_versions = ch_versions.mix( PMDTOOLS_FILTER.out.versions.first() )
Expand Down
Loading