Reorganize the results directory #67

matthuska · 2023-11-15T15:35:11Z

Reorganize the results/ directory to make it easier to understand. The current goal is:

results/
	clean/
		<sample_name>.fastq.gz
	removed/
		<sample_name>.fastq.gz
	intermedate/
		to-remove/
			mapped.fastq.gz
			unmapped.fastq.gz
			mapped.bam
			unmapped.bam
			dcs-strict/
				no-dcs.bam
				true-dcs.bam
				false-dcs.bam
			soft-clipped/
				soft-clipped.bam
				passed-clipped.bam
		to-keep/
			mapped.fastq.gz
			unmapped.fastq.gz
			mapped.bam
			unmapped.bam
			dcs-strict/
				no-dcs.bam
				true-dcs.bam
				false-dcs.bam
			soft-clipped/
				soft-clipped.bam
				passed-clipped.bam
	logs/\*.html
	qc/multiqc_report.html

(intermediate files will be prefixed with the sample name as well)

The initial PR disables all publishDir directives and only adds back publishing of clean files when the --keep directive is used, renaming files as required. I'll implement the rest of the file publishing in later commits on this same branch/PR.

…output for now, to new dir with new name.

…wn/rrna 'remove' genomes in appropriate intermediate folders

matthuska · 2023-11-16T12:26:44Z

Current status:

results/
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   └── ERR10035961.unmapped.fastq.gz
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       └── ERR10035961.unmapped.fastq.gz
├── logs
│   ├── execution_report_2023-11-16_13-08-20.html
│   └── execution_timeline_2023-11-16_13-08-20.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

7 directories, 14 files

matthuska · 2023-11-17T13:48:34Z

Still need to check if Illumina data (single and paired end) is organized properly, but for now ONT data seems to be fine.

Current output:

results
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.sorted.bam
│   │   ├── ERR10008447.sorted.bam.bai
│   │   ├── ERR10008447.sorted.flagstats.txt
│   │   ├── ERR10008447.sorted.idxstats.tsv
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   ├── ERR10035961.sorted.bam
│   │   ├── ERR10035961.sorted.bam.bai
│   │   ├── ERR10035961.sorted.flagstats.txt
│   │   ├── ERR10035961.sorted.idxstats.tsv
│   │   ├── ERR10035961.unmapped.fastq.gz
│   │   ├── soft-clipped
│   │   │   ├── ERR10008447.passed-clipped.bam
│   │   │   ├── ERR10008447.passed-clipped.bam.bai
│   │   │   ├── ERR10008447.soft-clipped.bam
│   │   │   ├── ERR10008447.soft-clipped.bam.bai
│   │   │   ├── ERR10035961.passed-clipped.bam
│   │   │   ├── ERR10035961.passed-clipped.bam.bai
│   │   │   ├── ERR10035961.soft-clipped.bam
│   │   │   └── ERR10035961.soft-clipped.bam.bai
│   │   └── strict-dcs
│   │       ├── ERR10035961.false-dcs.bam
│   │       ├── ERR10035961.false-dcs.bam.bai
│   │       ├── ERR10035961.no-dcs.bam
│   │       ├── ERR10035961.no-dcs.bam.bai
│   │       ├── ERR10035961.true-dcs.bam
│   │       └── ERR10035961.true-dcs.bam.bai
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.sorted.bam
│       ├── ERR10008447.sorted.bam.bai
│       ├── ERR10008447.sorted.flagstats.txt
│       ├── ERR10008447.sorted.idxstats.tsv
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       ├── ERR10035961.sorted.bam
│       ├── ERR10035961.sorted.bam.bai
│       ├── ERR10035961.sorted.flagstats.txt
│       ├── ERR10035961.sorted.idxstats.tsv
│       ├── ERR10035961.unmapped.fastq.gz
│       ├── soft-clipped
│       │   ├── ERR10008447.passed-clipped.bam
│       │   ├── ERR10008447.passed-clipped.bam.bai
│       │   ├── ERR10008447.soft-clipped.bam
│       │   ├── ERR10008447.soft-clipped.bam.bai
│       │   ├── ERR10035961.passed-clipped.bam
│       │   ├── ERR10035961.passed-clipped.bam.bai
│       │   ├── ERR10035961.soft-clipped.bam
│       │   └── ERR10035961.soft-clipped.bam.bai
│       └── strict-dcs
│           ├── ERR10035961.false-dcs.bam
│           ├── ERR10035961.false-dcs.bam.bai
│           ├── ERR10035961.no-dcs.bam
│           ├── ERR10035961.no-dcs.bam.bai
│           ├── ERR10035961.true-dcs.bam
│           └── ERR10035961.true-dcs.bam.bai
├── logs
│   ├── execution_report_2023-11-17_14-40-24.html
│   └── execution_timeline_2023-11-17_14-40-24.html
├── qc
│   └── multiqc_report.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

12 directories, 59 files

modules/alignment_processing.nf

Also remove dead code, skip creation of uncompressed fast[aq] file by letting samtools directly write a gzipped file. Remove whitespace at end of some lines.

…to results/

…g to the same file in a pipe

- output is expecting the suffix .sorted.bam

… for ARA and EBI.

modules/prepare_contamination.nf

MarieLataretu · 2023-12-10T10:30:36Z

modules/alignment_processing.nf

+  publishDir (
+    path: "${params.output}/intermediate",
+    mode: params.publish_dir_mode,
+    pattern: "*.sorted.flagstats.txt",


I stumbled over the discrepancy between the output pattern and the publishDir pattern:

In output we have *.flagstats.txt, which is fine - whatever BAM comes in, the flagstats.txt is the output.

In publishDir it's *.sorted.flagstats.txt. Is there a reason behind it? Would it make sense to change to *.flagstats.txt?

I think, currently, only sorted BAMs go into the process so that it might have no effect 🤔

This is intentional, because as you said only sorted bams are published. I don't see a point in even having unsorted bams if they all need to be sorted in the next step, but I didn't want to change too much in this branch. Honest question: am I missing something? Is there a situation where the stats could be different between the sorted and unsorted bams?

I agree that having unsorted bams published makes no sense.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

Currently in the pipeline we don't have a situation where this is needed though, right? I think that when the day comes where we need that functionality then we can revisit how the pipeline decides what is published. I'm also open to suggestions though, if you have a good general solution.

MarieLataretu · 2023-12-10T10:30:57Z

modules/alignment_processing.nf

+  publishDir (
+    path: "${params.output}/intermediate",
+    mode: params.publish_dir_mode,
+    pattern: "*.sorted.idxstats.tsv",


See flagstats_from_bam

hoelzer · 2023-12-10T16:39:28Z

I have a problem with --keep:

nextflow run clean.nf --input_type nano --input test-data/SARSCoV2-nanopore.fastq.gz --host hsa --control dcs -profile local,docker --keep test-data/wuhan.fasta

results in

ERROR ~ Error executing process > 'prepare_keep (1)'

Caused by:
  Process `prepare_keep (1)` terminated with an error exit status (4)

Command executed:

  # -L for following a symbolic link
  if ! ( file -L wuhan.fasta | grep -q 'BGZF; gzip compatible\|gzip compressed' ); then
    sed -i -e '$a\' wuhan.fasta
    bgzip -@ 8 < wuhan.fasta > wuhan.fasta.gz
    # now wuhan.fasta'.gz'
  else
    mv wuhan.fasta wuhan.fasta.tmp
    zcat wuhan.fasta.tmp | sed -e '$a\' | bgzip -@ 8 -c > wuhan.fasta.gz
  fi

Command exit status:
  4

Command output:
  (empty)

Command error:
  .command.sh: line 3: file: command not found
  sed: couldn't open temporary file ./sedUnb1iU: Permission denied

which I think is a problem with this line:

https://github.com/hoelzer/clean/pull/67/files#diff-4d4d28fbcc3862963d2c27d9710b1069e779fdeed9defb8b25d57d5bfe9e1acdL64

I am not exactly sure what's happening here? You are checking if the file is gz compressed and if not, you are doing an in-place sed on the linked file... which should not work, or? This would also change the original file?

Would it not be better to do a

    sed -e '\$a\\' ${fasta} > ${fasta}.tmp
    bgzip -@ ${task.cpus} < ${fasta}.tmp > ${fasta}.gz

instead of the current code?

    sed -i -e '\$a\\' ${fasta}
    bgzip -@ ${task.cpus} < ${fasta} > ${fasta}.gz

If I change that, this works for me.

Note: I am testing on a Macbook Air M2... where sed might do strange stuff. On the other hand, I am using the minimap2 Docker container which includes the Linux sed

hoelzer · 2023-12-10T21:28:35Z

Another thing when testing Illumina data:

nextflow run clean.nf --input_type illumina --input 'test-data/SARSCoV2-illumina.R{1,2}.fastq.gz' --host hsa --control phix  -profile local,docker

results in

ERROR ~ No such variable: nanoControlBedChannel

 -- Check script 'clean.nf' at line: 197 or see '.nextflow.log' file for more details

Ah maybe there is just a nanoControlBedChannel = [] missing in

// load control fasta sequence
if ( params.control ) {
  if ( 'phix' in params.control.split(',') ) {
    illuminaControlFastaChannel = Channel.fromPath( workflow.projectDir + '/data/controls/phix.fa.gz' , checkIfExists: true )
  } else { illuminaControlFastaChannel = Channel.empty() }
...

EDIT: yep. I commit the change.

hoelzer · 2023-12-10T21:34:15Z

Attention: I also submitted now my change to the sed command which was not working otherwise.

hoelzer · 2023-12-10T21:56:28Z

hm, but now there is another problem w/ Illumina after I fixed the empty BED channel thing:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa --control phix  -profile local,docker

executor >  local (3)
[skipped  ] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1, stored: 1 ✔
[21/73e43d] process > prepare_contamination:concat_contamination                [100%] 1 of 1 ✔
[9f/7bc61a] process > clean:minimap2 (1)                                        [100%] 1 of 1, failed: 1 ✘
[-        ] process > clean:sort_bam                                            -
[-        ] process > clean:index_bam                                           -
[-        ] process > clean:idxstats_from_bam                                   -
[-        ] process > clean:flagstats_from_bam                                  -
[-        ] process > clean:split_bam                                           -
[-        ] process > clean:index_bam2                                          -
[-        ] process > clean:fastq_from_bam                                      -
[18/01ff03] process > qc:fastqc (1)                                             [100%] 1 of 1 ✔
[-        ] process > qc:multiqc                                                [  0%] 0 of 1
ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::118.418*1.88] collected minimizers
  [main_samview] fail to read the header from "-".

Maybe we have to move some of these errors as issues to finalize this reorganization PR.

…ediate files are skipped.

matthuska · 2023-12-11T15:49:41Z

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

hoelzer · 2023-12-11T16:09:29Z

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

ok thx!

I am also not sure why

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa -profile local,docker

ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::117.876*1.85] collected minimizers
  [main_samview] fail to read the header from "-".

is happening. But looking into this right now

hoelzer · 2023-12-11T16:33:25Z

alright, forget this problem. I think this is a RAM issue. Everything fine when I am choosing a smaller --host such as eco

hoelzer · 2023-12-11T16:46:51Z

ok, I also checked the Illumina branch. It works for me, but the output still needs the same structuring as the ONT branch. Currently, I get for the command:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker

❯ tree results
results
├── illumina.mapped_1.fastq.gz
├── illumina.mapped_2.fastq.gz
├── illumina.mapped_singleton.fastq.gz
├── illumina.unmapped_1.fastq.gz
├── illumina.unmapped_2.fastq.gz
├── illumina.unmapped_singleton.fastq.gz
├── intermediate
│   ├── host.fa.fai
│   ├── host.fa.gz
│   └── map-to-remove
│       ├── illumina.mapped_1.fastq.gz
│       ├── illumina.mapped_2.fastq.gz
│       ├── illumina.mapped_singleton.fastq.gz
│       ├── illumina.sorted.bam
│       ├── illumina.sorted.bam.bai
│       ├── illumina.sorted.flagstats.txt
│       ├── illumina.sorted.idxstats.tsv
│       ├── illumina.unmapped_1.fastq.gz
│       ├── illumina.unmapped_2.fastq.gz
│       └── illumina.unmapped_singleton.fastq.gz
├── logs
│   ├── execution_report_2023-12-11_17-33-43.html
│   └── execution_timeline_2023-12-11_17-33-43.html
└── qc
    └── multiqc_report.html

so the clean and removed folders are missing and the output still has the potentially misleading mapped/unmapped labeling. Can this be easily changed? thx!

matthuska · 2023-12-11T16:47:54Z

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

hoelzer · 2023-12-11T16:49:39Z

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

thx! Besides, all looks good from my side. (and I can then also test the seqkit replacement for the sed command, let me know if the container needs to be updated as well in that context... ah but probably you can just switch from minimap2 to the seqkit label using the container that is already in the config)

matthuska · 2023-12-11T16:53:05Z

I had to add samtools and tabix (for bgzip) to the seqkit conda env, since all of those tools are necessary to create the block gzipped and indexed reference genome you want. You'll see the changes in the env file, and then I guess you'll need to create a matching container.

hoelzer · 2023-12-11T19:21:14Z

Switch to seqkit for check_own and concat_contamination

Alright thx! Yes that looks much cleaner ;)

I generated a matching container for seqkit and the dependencies and also tested it.

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker

works for me. I push the updated config container file.

hoelzer · 2023-12-11T19:22:53Z

Now, from my pov, currently only the correct Illumina output (folders for clean and removed) is missing

…geting crazy, need to rethink this.

matthuska added 2 commits November 15, 2023 16:30

Disable all publishDir directives. Add back the publishing of 'keep' …

fb759f5

…output for now, to new dir with new name.

Place fastq files that map to 'keep' genome and to the host/control/o…

df2497e

…wn/rrna 'remove' genomes in appropriate intermediate folders

matthuska added 3 commits November 16, 2023 13:54

Re-publish multiqc report to the qc/ directory (by default)

f739f8b

Only publish sorted.bam{,.bai} intermediate bam files

b2ec355

Publish idxstats and flagstats so they are next to their bam files

d91f94f

matthuska changed the title ~~Disable all publishDir directives. Add back the publishing of 'keep' …~~ Reorganize the results directory Nov 16, 2023

Handle --strict_dcs and --min_clip output in results dir

82b60c0

matthuska added 2 commits November 17, 2023 17:41

First steps towards making illumina/bbduk results dir look right

fc1318b

Add bed_samtools container bump minimap2 container (from main)

f03c8a5

MarieLataretu reviewed Nov 20, 2023

View reviewed changes

modules/alignment_processing.nf Outdated Show resolved Hide resolved

matthuska added 9 commits November 21, 2023 14:03

--keep is now working with bbduk. results/ is properly populated.

619c205

Add seqkit conda environment file

4611647

Output fasta when input is fasta

e5e5659

Also remove dead code, skip creation of uncompressed fast[aq] file by letting samtools directly write a gzipped file. Remove whitespace at end of some lines.

Add --no_intermediate argument to skip publishing intermediate files …

753f97c

…to results/

Add singularity/ dir to .gitignore

bb499ac

Tweak filter_fastq_by_name to work properly, avoid reading and writin…

14840fb

…g to the same file in a pipe

Replace scary zcat/paste/grep/tr with seqkit grep

bb074a7

Add clarifying comments to the filter_fastq_by_name process

40276ea

Remove overwrite: false from publishDir directives

dc59d28

matthuska marked this pull request as ready for review November 27, 2023 09:50

MarieLataretu and others added 4 commits December 10, 2023 11:03

fix stub touch

39bd52e

- output is expecting the suffix .sorted.bam

added seqkit to citations

d29a5ef

Add -keep to help mssg. Reformated --help a bit. Removed old profiles…

d2ae118

… for ARA and EBI.

adjusted the README a bit

95ce2ee

MarieLataretu reviewed Dec 10, 2023

View reviewed changes

change sed command for --keep and add empty BED channel for phix control

975e05a

Publish host genome and index for loading into IGV. Skipped if interm…

814e37b

…ediate files are skipped.

Switch to seqkit for check_own and concat_contamination

5d87ac4

bump seqkit container

4228c4b

Update fastq_from_bam to publish illumina data properly. Regexes are …

26bd1d0

…geting crazy, need to rethink this.

MarieLataretu approved these changes Dec 14, 2023

View reviewed changes

matthuska merged commit c22c7c1 into MarieLataretu/issue55 Dec 14, 2023

MarieLataretu deleted the feature/reorganize-results branch January 4, 2024 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize the results directory #67

Reorganize the results directory #67

matthuska commented Nov 15, 2023

matthuska commented Nov 16, 2023

matthuska commented Nov 17, 2023

MarieLataretu Dec 10, 2023

matthuska Dec 11, 2023

MarieLataretu Dec 13, 2023

matthuska Dec 13, 2023

MarieLataretu Dec 10, 2023

hoelzer commented Dec 10, 2023

hoelzer commented Dec 10, 2023 •

edited

Loading

hoelzer commented Dec 10, 2023

hoelzer commented Dec 10, 2023

matthuska commented Dec 11, 2023 •

edited

Loading

hoelzer commented Dec 11, 2023 •

edited

Loading

hoelzer commented Dec 11, 2023

hoelzer commented Dec 11, 2023

matthuska commented Dec 11, 2023

hoelzer commented Dec 11, 2023

matthuska commented Dec 11, 2023

hoelzer commented Dec 11, 2023

hoelzer commented Dec 11, 2023

Reorganize the results directory #67

Reorganize the results directory #67

Conversation

matthuska commented Nov 15, 2023

matthuska commented Nov 16, 2023

matthuska commented Nov 17, 2023

MarieLataretu Dec 10, 2023

Choose a reason for hiding this comment

matthuska Dec 11, 2023

Choose a reason for hiding this comment

MarieLataretu Dec 13, 2023

Choose a reason for hiding this comment

matthuska Dec 13, 2023

Choose a reason for hiding this comment

MarieLataretu Dec 10, 2023

Choose a reason for hiding this comment

hoelzer commented Dec 10, 2023

hoelzer commented Dec 10, 2023 • edited Loading

hoelzer commented Dec 10, 2023

hoelzer commented Dec 10, 2023

matthuska commented Dec 11, 2023 • edited Loading

hoelzer commented Dec 11, 2023 • edited Loading

hoelzer commented Dec 11, 2023

hoelzer commented Dec 11, 2023

matthuska commented Dec 11, 2023

hoelzer commented Dec 11, 2023

matthuska commented Dec 11, 2023

hoelzer commented Dec 11, 2023

hoelzer commented Dec 11, 2023

hoelzer commented Dec 10, 2023 •

edited

Loading

matthuska commented Dec 11, 2023 •

edited

Loading

hoelzer commented Dec 11, 2023 •

edited

Loading