Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize the results directory #67

Merged
merged 26 commits into from
Dec 14, 2023

Conversation

matthuska
Copy link
Collaborator

Reorganize the results/ directory to make it easier to understand. The current goal is:

results/
	clean/
		<sample_name>.fastq.gz
	removed/
		<sample_name>.fastq.gz
	intermedate/
		to-remove/
			mapped.fastq.gz
			unmapped.fastq.gz
			mapped.bam
			unmapped.bam
			dcs-strict/
				no-dcs.bam
				true-dcs.bam
				false-dcs.bam
			soft-clipped/
				soft-clipped.bam
				passed-clipped.bam
		to-keep/
			mapped.fastq.gz
			unmapped.fastq.gz
			mapped.bam
			unmapped.bam
			dcs-strict/
				no-dcs.bam
				true-dcs.bam
				false-dcs.bam
			soft-clipped/
				soft-clipped.bam
				passed-clipped.bam
	logs/\*.html
	qc/multiqc_report.html

(intermediate files will be prefixed with the sample name as well)

The initial PR disables all publishDir directives and only adds back publishing of clean files when the --keep directive is used, renaming files as required. I'll implement the rest of the file publishing in later commits on this same branch/PR.

…wn/rrna 'remove' genomes in appropriate intermediate folders
@matthuska
Copy link
Collaborator Author

Current status:

results/
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   └── ERR10035961.unmapped.fastq.gz
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       └── ERR10035961.unmapped.fastq.gz
├── logs
│   ├── execution_report_2023-11-16_13-08-20.html
│   └── execution_timeline_2023-11-16_13-08-20.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

7 directories, 14 files

@matthuska matthuska changed the title Disable all publishDir directives. Add back the publishing of 'keep' … Reorganize the results directory Nov 16, 2023
@matthuska
Copy link
Collaborator Author

Still need to check if Illumina data (single and paired end) is organized properly, but for now ONT data seems to be fine.

Current output:

results
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.sorted.bam
│   │   ├── ERR10008447.sorted.bam.bai
│   │   ├── ERR10008447.sorted.flagstats.txt
│   │   ├── ERR10008447.sorted.idxstats.tsv
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   ├── ERR10035961.sorted.bam
│   │   ├── ERR10035961.sorted.bam.bai
│   │   ├── ERR10035961.sorted.flagstats.txt
│   │   ├── ERR10035961.sorted.idxstats.tsv
│   │   ├── ERR10035961.unmapped.fastq.gz
│   │   ├── soft-clipped
│   │   │   ├── ERR10008447.passed-clipped.bam
│   │   │   ├── ERR10008447.passed-clipped.bam.bai
│   │   │   ├── ERR10008447.soft-clipped.bam
│   │   │   ├── ERR10008447.soft-clipped.bam.bai
│   │   │   ├── ERR10035961.passed-clipped.bam
│   │   │   ├── ERR10035961.passed-clipped.bam.bai
│   │   │   ├── ERR10035961.soft-clipped.bam
│   │   │   └── ERR10035961.soft-clipped.bam.bai
│   │   └── strict-dcs
│   │       ├── ERR10035961.false-dcs.bam
│   │       ├── ERR10035961.false-dcs.bam.bai
│   │       ├── ERR10035961.no-dcs.bam
│   │       ├── ERR10035961.no-dcs.bam.bai
│   │       ├── ERR10035961.true-dcs.bam
│   │       └── ERR10035961.true-dcs.bam.bai
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.sorted.bam
│       ├── ERR10008447.sorted.bam.bai
│       ├── ERR10008447.sorted.flagstats.txt
│       ├── ERR10008447.sorted.idxstats.tsv
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       ├── ERR10035961.sorted.bam
│       ├── ERR10035961.sorted.bam.bai
│       ├── ERR10035961.sorted.flagstats.txt
│       ├── ERR10035961.sorted.idxstats.tsv
│       ├── ERR10035961.unmapped.fastq.gz
│       ├── soft-clipped
│       │   ├── ERR10008447.passed-clipped.bam
│       │   ├── ERR10008447.passed-clipped.bam.bai
│       │   ├── ERR10008447.soft-clipped.bam
│       │   ├── ERR10008447.soft-clipped.bam.bai
│       │   ├── ERR10035961.passed-clipped.bam
│       │   ├── ERR10035961.passed-clipped.bam.bai
│       │   ├── ERR10035961.soft-clipped.bam
│       │   └── ERR10035961.soft-clipped.bam.bai
│       └── strict-dcs
│           ├── ERR10035961.false-dcs.bam
│           ├── ERR10035961.false-dcs.bam.bai
│           ├── ERR10035961.no-dcs.bam
│           ├── ERR10035961.no-dcs.bam.bai
│           ├── ERR10035961.true-dcs.bam
│           └── ERR10035961.true-dcs.bam.bai
├── logs
│   ├── execution_report_2023-11-17_14-40-24.html
│   └── execution_timeline_2023-11-17_14-40-24.html
├── qc
│   └── multiqc_report.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

12 directories, 59 files

@matthuska matthuska marked this pull request as ready for review November 27, 2023 09:50
modules/prepare_contamination.nf Show resolved Hide resolved
publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.sorted.flagstats.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled over the discrepancy between the output pattern and the publishDir pattern:

In output we have *.flagstats.txt, which is fine - whatever BAM comes in, the flagstats.txt is the output.

In publishDir it's *.sorted.flagstats.txt. Is there a reason behind it? Would it make sense to change to *.flagstats.txt?

I think, currently, only sorted BAMs go into the process so that it might have no effect 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, because as you said only sorted bams are published. I don't see a point in even having unsorted bams if they all need to be sorted in the next step, but I didn't want to change too much in this branch. Honest question: am I missing something? Is there a situation where the stats could be different between the sorted and unsorted bams?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having unsorted bams published makes no sense.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.

Currently in the pipeline we don't have a situation where this is needed though, right? I think that when the day comes where we need that functionality then we can revisit how the pipeline decides what is published. I'm also open to suggestions though, if you have a good general solution.

publishDir (
path: "${params.output}/intermediate",
mode: params.publish_dir_mode,
pattern: "*.sorted.idxstats.tsv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See flagstats_from_bam

@hoelzer
Copy link
Member

hoelzer commented Dec 10, 2023

I have a problem with --keep:

nextflow run clean.nf --input_type nano --input test-data/SARSCoV2-nanopore.fastq.gz --host hsa --control dcs -profile local,docker --keep test-data/wuhan.fasta

results in

ERROR ~ Error executing process > 'prepare_keep (1)'

Caused by:
  Process `prepare_keep (1)` terminated with an error exit status (4)

Command executed:

  # -L for following a symbolic link
  if ! ( file -L wuhan.fasta | grep -q 'BGZF; gzip compatible\|gzip compressed' ); then
    sed -i -e '$a\' wuhan.fasta
    bgzip -@ 8 < wuhan.fasta > wuhan.fasta.gz
    # now wuhan.fasta'.gz'
  else
    mv wuhan.fasta wuhan.fasta.tmp
    zcat wuhan.fasta.tmp | sed -e '$a\' | bgzip -@ 8 -c > wuhan.fasta.gz
  fi

Command exit status:
  4

Command output:
  (empty)

Command error:
  .command.sh: line 3: file: command not found
  sed: couldn't open temporary file ./sedUnb1iU: Permission denied

which I think is a problem with this line:

https://github.com/hoelzer/clean/pull/67/files#diff-4d4d28fbcc3862963d2c27d9710b1069e779fdeed9defb8b25d57d5bfe9e1acdL64

I am not exactly sure what's happening here? You are checking if the file is gz compressed and if not, you are doing an in-place sed on the linked file... which should not work, or? This would also change the original file?

Would it not be better to do a

    sed -e '\$a\\' ${fasta} > ${fasta}.tmp
    bgzip -@ ${task.cpus} < ${fasta}.tmp > ${fasta}.gz

instead of the current code?

    sed -i -e '\$a\\' ${fasta}
    bgzip -@ ${task.cpus} < ${fasta} > ${fasta}.gz

If I change that, this works for me.

Note: I am testing on a Macbook Air M2... where sed might do strange stuff. On the other hand, I am using the minimap2 Docker container which includes the Linux sed

@hoelzer
Copy link
Member

hoelzer commented Dec 10, 2023

Another thing when testing Illumina data:

nextflow run clean.nf --input_type illumina --input 'test-data/SARSCoV2-illumina.R{1,2}.fastq.gz' --host hsa --control phix  -profile local,docker

results in

ERROR ~ No such variable: nanoControlBedChannel

 -- Check script 'clean.nf' at line: 197 or see '.nextflow.log' file for more details

Ah maybe there is just a nanoControlBedChannel = [] missing in

// load control fasta sequence
if ( params.control ) {
  if ( 'phix' in params.control.split(',') ) {
    illuminaControlFastaChannel = Channel.fromPath( workflow.projectDir + '/data/controls/phix.fa.gz' , checkIfExists: true )
  } else { illuminaControlFastaChannel = Channel.empty() }
...

EDIT: yep. I commit the change.

@hoelzer
Copy link
Member

hoelzer commented Dec 10, 2023

Attention: I also submitted now my change to the sed command which was not working otherwise.

@hoelzer
Copy link
Member

hoelzer commented Dec 10, 2023

hm, but now there is another problem w/ Illumina after I fixed the empty BED channel thing:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa --control phix  -profile local,docker
executor >  local (3)
[skipped  ] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1, stored: 1 ✔
[21/73e43d] process > prepare_contamination:concat_contamination                [100%] 1 of 1 ✔
[9f/7bc61a] process > clean:minimap2 (1)                                        [100%] 1 of 1, failed: 1 ✘
[-        ] process > clean:sort_bam                                            -
[-        ] process > clean:index_bam                                           -
[-        ] process > clean:idxstats_from_bam                                   -
[-        ] process > clean:flagstats_from_bam                                  -
[-        ] process > clean:split_bam                                           -
[-        ] process > clean:index_bam2                                          -
[-        ] process > clean:fastq_from_bam                                      -
[18/01ff03] process > qc:fastqc (1)                                             [100%] 1 of 1 ✔
[-        ] process > qc:multiqc                                                [  0%] 0 of 1
ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::118.418*1.88] collected minimizers
  [main_samview] fail to read the header from "-".

Maybe we have to move some of these errors as issues to finalize this reorganization PR.

@matthuska
Copy link
Collaborator Author

matthuska commented Dec 11, 2023

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

ok thx!

I am also not sure why

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa -profile local,docker

ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::117.876*1.85] collected minimizers
  [main_samview] fail to read the header from "-".

is happening. But looking into this right now

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

alright, forget this problem. I think this is a RAM issue. Everything fine when I am choosing a smaller --host such as eco

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

ok, I also checked the Illumina branch. It works for me, but the output still needs the same structuring as the ONT branch. Currently, I get for the command:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker
❯ tree results
results
├── illumina.mapped_1.fastq.gz
├── illumina.mapped_2.fastq.gz
├── illumina.mapped_singleton.fastq.gz
├── illumina.unmapped_1.fastq.gz
├── illumina.unmapped_2.fastq.gz
├── illumina.unmapped_singleton.fastq.gz
├── intermediate
│   ├── host.fa.fai
│   ├── host.fa.gz
│   └── map-to-remove
│       ├── illumina.mapped_1.fastq.gz
│       ├── illumina.mapped_2.fastq.gz
│       ├── illumina.mapped_singleton.fastq.gz
│       ├── illumina.sorted.bam
│       ├── illumina.sorted.bam.bai
│       ├── illumina.sorted.flagstats.txt
│       ├── illumina.sorted.idxstats.tsv
│       ├── illumina.unmapped_1.fastq.gz
│       ├── illumina.unmapped_2.fastq.gz
│       └── illumina.unmapped_singleton.fastq.gz
├── logs
│   ├── execution_report_2023-12-11_17-33-43.html
│   └── execution_timeline_2023-12-11_17-33-43.html
└── qc
    └── multiqc_report.html

so the clean and removed folders are missing and the output still has the potentially misleading mapped/unmapped labeling. Can this be easily changed? thx!

@matthuska
Copy link
Collaborator Author

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

thx! Besides, all looks good from my side. (and I can then also test the seqkit replacement for the sed command, let me know if the container needs to be updated as well in that context... ah but probably you can just switch from minimap2 to the seqkit label using the container that is already in the config)

@matthuska
Copy link
Collaborator Author

I had to add samtools and tabix (for bgzip) to the seqkit conda env, since all of those tools are necessary to create the block gzipped and indexed reference genome you want. You'll see the changes in the env file, and then I guess you'll need to create a matching container.

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

Switch to seqkit for check_own and concat_contamination

Alright thx! Yes that looks much cleaner ;)

I generated a matching container for seqkit and the dependencies and also tested it.

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker

works for me. I push the updated config container file.

@hoelzer
Copy link
Member

hoelzer commented Dec 11, 2023

Now, from my pov, currently only the correct Illumina output (folders for clean and removed) is missing

@matthuska matthuska merged commit c22c7c1 into MarieLataretu/issue55 Dec 14, 2023
@MarieLataretu MarieLataretu deleted the feature/reorganize-results branch January 4, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants