Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize the results directory #67

Merged
merged 26 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
fb759f5
Disable all publishDir directives. Add back the publishing of 'keep' …
matthuska Nov 15, 2023
df2497e
Place fastq files that map to 'keep' genome and to the host/control/o…
matthuska Nov 16, 2023
f739f8b
Re-publish multiqc report to the qc/ directory (by default)
matthuska Nov 16, 2023
b2ec355
Only publish sorted.bam{,.bai} intermediate bam files
matthuska Nov 16, 2023
d91f94f
Publish idxstats and flagstats so they are next to their bam files
matthuska Nov 16, 2023
82b60c0
Handle --strict_dcs and --min_clip output in results dir
matthuska Nov 17, 2023
fc1318b
First steps towards making illumina/bbduk results dir look right
matthuska Nov 17, 2023
f03c8a5
Add bed_samtools container bump minimap2 container (from main)
matthuska Nov 19, 2023
619c205
--keep is now working with bbduk. results/ is properly populated.
matthuska Nov 21, 2023
4611647
Add seqkit conda environment file
matthuska Nov 21, 2023
e5e5659
Output fasta when input is fasta
matthuska Nov 21, 2023
753f97c
Add --no_intermediate argument to skip publishing intermediate files …
matthuska Nov 22, 2023
bb499ac
Add singularity/ dir to .gitignore
matthuska Nov 22, 2023
14840fb
Tweak filter_fastq_by_name to work properly, avoid reading and writin…
matthuska Nov 22, 2023
bb074a7
Replace scary zcat/paste/grep/tr with seqkit grep
matthuska Nov 22, 2023
40276ea
Add clarifying comments to the filter_fastq_by_name process
matthuska Nov 22, 2023
dc59d28
Remove overwrite: false from publishDir directives
matthuska Nov 22, 2023
39bd52e
fix stub touch
MarieLataretu Dec 10, 2023
d29a5ef
added seqkit to citations
MarieLataretu Dec 10, 2023
d2ae118
Add -keep to help mssg. Reformated --help a bit. Removed old profiles…
Dec 10, 2023
95ce2ee
adjusted the README a bit
Dec 10, 2023
975e05a
change sed command for --keep and add empty BED channel for phix control
Dec 10, 2023
814e37b
Publish host genome and index for loading into IGV. Skipped if interm…
matthuska Dec 11, 2023
5d87ac4
Switch to seqkit for check_own and concat_contamination
matthuska Dec 11, 2023
4228c4b
bump seqkit container
Dec 11, 2023
26bd1d0
Update fastq_from_bam to publish illumina data properly. Regexes are …
matthuska Dec 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,49 @@ Included in this repository are:

<sub><sub>The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).</sub></sub>

## Results

Running the pipeline will create a directory called `results/` in the current directory with some or all of the following directories and files:

```text
results/
clean/
<sample_name>.fastq.gz
removed/
<sample_name>.fastq.gz
intermedate/
to-remove/
mapped.fastq.gz
unmapped.fastq.gz
mapped.bam
unmapped.bam
dcs-strict/
no-dcs.bam
true-dcs.bam
false-dcs.bam
soft-clipped/
soft-clipped.bam
passed-clipped.bam
to-keep/
mapped.fastq.gz
unmapped.fastq.gz
mapped.bam
unmapped.bam
dcs-strict/
no-dcs.bam
true-dcs.bam
false-dcs.bam
soft-clipped/
soft-clipped.bam
passed-clipped.bam
logs/\*.html
qc/multiqc_report.html
```

The most important files you are likely interested in are `results/clean/<sample_name>.fastq.gz`, which are the "cleaned" reads. These are the input reads that *do not* map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the `--keep` option. Any files that were removed from your input fasta file are placed in `results/removed/<sample_name>.fastq.gz`.

For debugging purposes we also provide various intermediate results in the `intermediate/` folder.

## Citations

If you use `CLEAN` in your work, please consider citing our preprint:
Expand Down
5 changes: 3 additions & 2 deletions configs/container.config
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
process {
withLabel: smallTask { container = 'nanozoo/samtools:1.14--d8fb865' }
withLabel: minimap2 { container = 'nanozoo/minimap2:2.26--ef54a1d' }
withLabel: minimap2 { container = 'nanozoo/minimap2:2.26--d9ef6b6' }
withLabel: bbmap { container = 'nanozoo/bbmap:38.79--8e915d7' }
withLabel: multiqc { container = 'nanozoo/multiqc:1.9--aba729b' }
withLabel: fastqc { container = 'nanozoo/fastqc:0.11.9--f61b8b4' }
withLabel: nanoplot { container = 'nanozoo/nanoplot:1.32.0--1ae6f5d' }
withLabel: quast { container = 'nanozoo/quast:5.0.2--e7f0cfe' }
}
withLabel: bed_samtools { container = 'nanozoo/bed_samtools:2.30.0--cc7d1b9' }
}
116 changes: 92 additions & 24 deletions modules/alignment_processing.nf
Original file line number Diff line number Diff line change
Expand Up @@ -50,68 +50,113 @@ process filter_soft_clipped_alignments {
label 'samclipy'
label 'smallTask'

publishDir "${params.output}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.bam*"
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "${name}*.bam{,.bai}",
overwrite: false,
matthuska marked this conversation as resolved.
Show resolved Hide resolved
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/soft-clipped/${fn.replaceAll(~'^keep_', '')}" : "map-to-remove/soft-clipped/${fn}"
}
)

input:
tuple val(name), path (bam)
val (minClip)

output:
tuple val(name), val('unmapped'), path ('*.softclipped.bam'), emit: bam_clipped
tuple val(name), val('mapped'), path ('*.passedclipped.bam'), emit: bam_ok_clipped
tuple val(name), val('unmapped'), path ('*.soft-clipped.bam'), emit: bam_clipped
tuple val(name), val('mapped'), path ('*.passed-clipped.bam'), emit: bam_ok_clipped
tuple val(name), path ('*.bam.bai')

script:
"""
git clone https://github.com/MarieLataretu/samclipy.git --branch v0.0.2 || git clone [email protected]:MarieLataretu/samclipy.git --branch v0.0.2
samtools view -h ${bam} | python samclipy/samclipy.py --invert --minClip ${minClip} | samtools sort > ${name}.softclipped.bam
samtools view -h ${bam} | python samclipy/samclipy.py --minClip ${minClip} | samtools sort > ${name}.passedclipped.bam
samtools index ${name}.softclipped.bam
samtools index ${name}.passedclipped.bam
samtools view -h ${bam} | python samclipy/samclipy.py --invert --minClip ${minClip} | samtools sort > ${name}.soft-clipped.bam
samtools view -h ${bam} | python samclipy/samclipy.py --minClip ${minClip} | samtools sort > ${name}.passed-clipped.bam
samtools index ${name}.soft-clipped.bam
samtools index ${name}.passed-clipped.bam
"""
stub:
"""
touch ${name}.softclipped.bam ${name}.passedclipped.bam ${name}.softclipped.bam.bai ${name}.passedclipped.bam.bai
touch ${name}.soft-clipped.bam ${name}.passed-clipped.bam ${name}.soft-clipped.bam.bai ${name}.passed-clipped.bam.bai
"""
}

process filter_true_dcs_alignments {
label 'bed_samtools'

publishDir "${params.output}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.bam*"
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "${name}*.bam{,.bai}",
overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/strict-dcs/${fn.replaceAll(~'^keep_', '')}" : "map-to-remove/strict-dcs/${fn}"
}
)

input:
tuple val(name), path (bam)
path (dcs_ends_bed)

output:
tuple val(name), val('mapped'), path ("${name}.no_dcs.bam"), emit: no_dcs
tuple val(name), val('mapped'), path ("${name}.true_dcs.bam"), emit: true_dcs
tuple val(name), val('unmapped'), path ("${name}.false_dcs.bam"), emit: false_dcs
tuple val(name), val('mapped'), path ("${name}.no-dcs.bam"), emit: no_dcs
tuple val(name), val('mapped'), path ("${name}.true-dcs.bam"), emit: true_dcs
tuple val(name), val('unmapped'), path ("${name}.false-dcs.bam"), emit: false_dcs
tuple val(name), path ('*.bam.bai')
path('dcs.bam')

script:
"""
# true spike in: 1-65 || 1-92; 3513-3560 (len 48)
samtools view -b -h -e 'rname=="Lambda_3.6kb"' ${bam} > dcs.bam
samtools view -b -h -e 'rname!="Lambda_3.6kb"' ${bam} > ${name}.no_dcs.bam
bedtools intersect -wa -ubam -a dcs.bam -b ${dcs_ends_bed} > ${name}.true_dcs.bam
bedtools intersect -v -ubam -a dcs.bam -b ${dcs_ends_bed} > ${name}.false_dcs.bam
samtools view -b -h -e 'rname!="Lambda_3.6kb"' ${bam} > ${name}.no-dcs.bam
bedtools intersect -wa -ubam -a dcs.bam -b ${dcs_ends_bed} > ${name}.true-dcs.bam
bedtools intersect -v -ubam -a dcs.bam -b ${dcs_ends_bed} > ${name}.false-dcs.bam
samtools index dcs.bam
samtools index ${name}.no_dcs.bam
samtools index ${name}.true_dcs.bam
samtools index ${name}.false_dcs.bam
samtools index ${name}.no-dcs.bam
samtools index ${name}.true-dcs.bam
samtools index ${name}.false-dcs.bam
"""
stub:
"""
touch ${name}.no_dcs.bam ${name}.true_dcs.bam ${name}.false_dcs.bam ${name}.no_dcs.bam.bai ${name}.true_dcs.bam.bai ${name}.false_dcs.bam.bai
touch ${name}.no-dcs.bam ${name}.true-dcs.bam ${name}.false-dcs.bam ${name}.no-dcs.bam.bai ${name}.true-dcs.bam.bai ${name}.false-dcs.bam.bai
"""
}

process fastq_from_bam {
label 'minimap2'
publishDir "${params.output}/${params.tool}/${name}", mode: 'copy', pattern: "*.gz"

publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.gz",
overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/${fn.replaceAll(~'^keep_', '').replaceAll(~'_merged', '')}" : "map-to-remove/${fn.replaceAll(~'_merged', '')}"
}
)

// When using --keep, cleaned fastq files are generated by the
// filter_fastq_by_name process and not here
if ( !params.keep ) {
publishDir (
path: params.output,
overwrite: false,
mode: params.publish_dir_mode,
pattern: "*.gz",
saveAs: { fn ->
fn.endsWith('.unmapped.fastq.gz') ? "clean/${fn}".replaceAll(~'.unmapped.fastq.gz$', '.fastq.gz') :
fn.endsWith('.mapped.fastq.gz') ? "removed/${fn}".replaceAll(~'.mapped.fastq.gz$', '.fastq.gz') :
fn.endsWith('.unmapped_merged.fastq.gz') ? "clean/${fn}".replaceAll(~'.unmapped_merged.fastq.gz$', '.fastq.gz') :
fn.endsWith('.mapped_merged.fastq.gz') ? "removed/${fn}".replaceAll(~'.mapped_merged.fastq.gz$', '.fastq.gz') :
fn.endsWith('.unmapped_merged_merged.fastq.gz') ? "clean/${fn}".replaceAll(~'.unmapped_merged_merged.fastq.gz$', '.fastq.gz') :
fn.endsWith('.soft-clipped_merged.fastq.gz') ? "clean/${fn}".replaceAll(~'.soft-clipped_merged.fastq.gz$', '.fastq.gz') :
fn
}
)
}

input:
tuple val(name), val(type), path(bam)
Expand Down Expand Up @@ -144,7 +189,15 @@ process fastq_from_bam {
process idxstats_from_bam {
label 'minimap2'

publishDir "${params.output}/minimap2/${name}", mode: 'copy', pattern: "${bam.baseName}.idxstats.tsv"
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.sorted.idxstats.tsv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See flagstats_from_bam

overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/${fn.replaceAll(~'^keep_', '')}" : "map-to-remove/${fn}"
}
)

input:
tuple val(name), val(type), path(bam), path(bai)
Expand All @@ -165,7 +218,15 @@ process idxstats_from_bam {
process flagstats_from_bam {
label 'minimap2'

publishDir "${params.output}/minimap2", mode: params.publish_dir_mode, pattern: "${bam.baseName}.flagstats.txt"
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.sorted.flagstats.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled over the discrepancy between the output pattern and the publishDir pattern:

In output we have *.flagstats.txt, which is fine - whatever BAM comes in, the flagstats.txt is the output.

In publishDir it's *.sorted.flagstats.txt. Is there a reason behind it? Would it make sense to change to *.flagstats.txt?

I think, currently, only sorted BAMs go into the process so that it might have no effect 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, because as you said only sorted bams are published. I don't see a point in even having unsorted bams if they all need to be sorted in the next step, but I didn't want to change too much in this branch. Honest question: am I missing something? Is there a situation where the stats could be different between the sorted and unsorted bams?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having unsorted bams published makes no sense.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

Currently in the pipeline we don't have a situation where this is needed though, right? I think that when the day comes where we need that functionality then we can revisit how the pipeline decides what is published. I'm also open to suggestions though, if you have a good general solution.

overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/${fn.replaceAll(~'^keep_', '')}" : "map-to-remove/${fn}"
}
)

input:
tuple val(name), val(type), path(bam), path(bai)
Expand Down Expand Up @@ -206,8 +267,15 @@ process sort_bam {
process index_bam {
label 'minimap2'

publishDir "${params.output}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.bam"
publishDir "${params.output}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.bam.bai"
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.sorted.bam{,.bai}",
overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/${fn.replaceAll(~'^keep_', '')}" : "map-to-remove/${fn}"
}
)

input:
tuple val(name), val(type), path(bam)
Expand Down
35 changes: 29 additions & 6 deletions modules/bbmap.nf
Original file line number Diff line number Diff line change
@@ -1,7 +1,30 @@
process bbduk {
label 'bbmap'

publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.gz",
overwrite: false,
saveAs: { fn ->
fn.startsWith("keep_") ? "map-to-keep/${fn.replaceAll(~'^keep_', '').replaceAll(~'_merged', '')}" : "map-to-remove/${fn.replaceAll(~'_merged', '')}"
}
)

publishDir "${params.output}/${params.tool}/${name}", mode: 'copy', pattern: "*.gz"
//if ( !params.keep ) {
if ( true ) {
publishDir (
path: params.output,
overwrite: false,
mode: params.publish_dir_mode,
pattern: "*.gz",
saveAs: { fn ->
fn.endsWith('.clean.fastq.gz') ? "clean/${fn}".replaceAll(~'.fastq.clean.fastq.gz$', '.fastq.gz') :
fn.endsWith('.contamination.fastq.gz') ? "removed/${fn}".replaceAll(~'.fastq.contamination.fastq.gz$', '.fastq.gz') :
fn
}
)
}

input:
tuple val(name), path(reads)
Expand All @@ -11,21 +34,21 @@ process bbduk {
val name, emit: name
tuple val(name), val('clean'), path('*clean.fastq.gz'), emit: cleaned_reads
tuple val(name), val('contamination'), path('*contamination.fastq.gz'), emit: contaminated_reads
tuple val(name), path("${name}_bbduk_stats.txt"), emit: stats
tuple val(name), path("${name}.bbduk_stats.txt"), emit: stats

script:
if ( params.lib_pairedness == 'paired' ) {
"""
MEM=\$(echo ${task.memory} | sed 's/ GB//g')
bbduk.sh -Xmx\${MEM}g ref=${db} threads=${task.cpus} stats=${name}_bbduk_stats.txt ordered=t k=${params.bbduk_kmer} in=${reads[0]} in2=${reads[1]} out=${reads[0].baseName}.clean.fastq out2=${reads[1].baseName}.clean.fastq outm=${reads[0].baseName}.contamination.fastq outm2=${reads[1].baseName}.contamination.fastq
bbduk.sh -Xmx\${MEM}g ref=${db} threads=${task.cpus} stats=${name}.bbduk_stats.txt ordered=t k=${params.bbduk_kmer} in=${reads[0]} in2=${reads[1]} out=${reads[0].baseName}.clean.fastq out2=${reads[1].baseName}.clean.fastq outm=${reads[0].baseName}.contamination.fastq outm2=${reads[1].baseName}.contamination.fastq

gzip --no-name *.clean.fastq
gzip --no-name *.contamination.fastq
"""
} else if ( params.lib_pairedness == 'single' ) {
"""
MEM=\$(echo ${task.memory} | sed 's/ GB//g')
bbduk.sh -Xmx\${MEM}g ref=${db} threads=${task.cpus} stats=${name}_bbduk_stats.txt ordered=t k=${params.bbduk_kmer} in=${reads} out=${name}.clean.fastq outm=${name}.contamination.fastq
bbduk.sh -Xmx\${MEM}g ref=${db} threads=${task.cpus} stats=${name}.bbduk_stats.txt ordered=t k=${params.bbduk_kmer} in=${reads} out=${name}.clean.fastq outm=${name}.contamination.fastq

gzip --no-name *.clean.fastq
gzip --no-name *.contamination.fastq
Expand All @@ -36,12 +59,12 @@ process bbduk {
stub:
if ( params.lib_pairedness == 'paired' ) {
"""
touch ${name}_bbduk_stats.txt
touch ${name}.bbduk_stats.txt
touch ${reads[0].baseName}.clean.fastq.gz ${reads[0].baseName}.contamination.fastq.gz ${reads[1].baseName}.clean.fastq.gz ${reads[1].baseName}.contamination.fastq.gz
"""
} else if ( params.lib_pairedness == 'single' ) {
"""
touch ${name}_bbduk_stats.txt
touch ${name}.bbduk_stats.txt
touch ${name}.contamination.fastq.gz ${name}.clean.fastq.gz
"""
}
Expand Down
5 changes: 1 addition & 4 deletions modules/prepare_contamination.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ process download_host {
label 'minimap2'

if (params.cloudProcess) {
publishDir "${params.databases}/hosts", mode: params.publish_dir_mode, pattern: "*.fa.gz"
publishDir "${params.databases}/hosts", mode: params.publish_dir_mode, pattern: "*.fa.gz"
}
else {
storeDir "${params.databases}/hosts"
Expand Down Expand Up @@ -78,9 +78,6 @@ process check_own {
process concat_contamination {
label 'minimap2'

publishDir "${params.output}/", mode: params.publish_dir_mode, pattern: "db.fa.gz"
publishDir "${params.output}/", mode: params.publish_dir_mode, pattern: "db.fa.fai"
matthuska marked this conversation as resolved.
Show resolved Hide resolved

input:
path fastas

Expand Down
2 changes: 1 addition & 1 deletion modules/unused/alignment_processing.nf
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ process filter_un_mapped_alignments {
process make_mapped_bam {
label 'minimap2'

publishDir "${params.output}/${name}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.mapped.bam*"
publishDir "${params.output}/${name}/${params.tool}", mode: params.publish_dir_mode, pattern: "*.mapped.bam*", enabled: false

input:
tuple val(name), path(sam), path(reads)
Expand Down
Loading