-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorganize the results directory #67
Conversation
…output for now, to new dir with new name.
…wn/rrna 'remove' genomes in appropriate intermediate folders
Current status:
|
Still need to check if Illumina data (single and paired end) is organized properly, but for now ONT data seems to be fine. Current output:
|
Also remove dead code, skip creation of uncompressed fast[aq] file by letting samtools directly write a gzipped file. Remove whitespace at end of some lines.
…g to the same file in a pipe
- output is expecting the suffix .sorted.bam
… for ARA and EBI.
publishDir ( | ||
path: "${params.output}/intermediate", | ||
mode: params.publish_dir_mode, | ||
pattern: "*.sorted.flagstats.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stumbled over the discrepancy between the output
pattern and the publishDir
pattern:
In output
we have *.flagstats.txt
, which is fine - whatever BAM comes in, the flagstats.txt is the output.
In publishDir
it's *.sorted.flagstats.txt
. Is there a reason behind it? Would it make sense to change to *.flagstats.txt
?
I think, currently, only sorted BAMs go into the process so that it might have no effect 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional, because as you said only sorted bams are published. I don't see a point in even having unsorted bams if they all need to be sorted in the next step, but I didn't want to change too much in this branch. Honest question: am I missing something? Is there a situation where the stats could be different between the sorted and unsorted bams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having unsorted bams published makes no sense.
So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted
. A bam might be sorted, but might not have the suffix .sorted
in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm thinking of the general usage of this process and it would be unexpected if the result is only published when it has the suffix .sorted. A bam might be sorted, but might not have the suffix .sorted in its name.
Currently in the pipeline we don't have a situation where this is needed though, right? I think that when the day comes where we need that functionality then we can revisit how the pipeline decides what is published. I'm also open to suggestions though, if you have a good general solution.
publishDir ( | ||
path: "${params.output}/intermediate", | ||
mode: params.publish_dir_mode, | ||
pattern: "*.sorted.idxstats.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See flagstats_from_bam
I have a problem with nextflow run clean.nf --input_type nano --input test-data/SARSCoV2-nanopore.fastq.gz --host hsa --control dcs -profile local,docker --keep test-data/wuhan.fasta results in
which I think is a problem with this line: I am not exactly sure what's happening here? You are checking if the file is gz compressed and if not, you are doing an in-place Would it not be better to do a sed -e '\$a\\' ${fasta} > ${fasta}.tmp
bgzip -@ ${task.cpus} < ${fasta}.tmp > ${fasta}.gz instead of the current code? sed -i -e '\$a\\' ${fasta}
bgzip -@ ${task.cpus} < ${fasta} > ${fasta}.gz If I change that, this works for me. Note: I am testing on a Macbook Air M2... where |
Another thing when testing Illumina data: nextflow run clean.nf --input_type illumina --input 'test-data/SARSCoV2-illumina.R{1,2}.fastq.gz' --host hsa --control phix -profile local,docker results in
Ah maybe there is just a // load control fasta sequence
if ( params.control ) {
if ( 'phix' in params.control.split(',') ) {
illuminaControlFastaChannel = Channel.fromPath( workflow.projectDir + '/data/controls/phix.fa.gz' , checkIfExists: true )
} else { illuminaControlFastaChannel = Channel.empty() }
... EDIT: yep. I commit the change. |
Attention: I also submitted now my change to the |
hm, but now there is another problem w/ Illumina after I fixed the empty BED channel thing: nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa --control phix -profile local,docker executor > local (3)
[skipped ] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1, stored: 1 ✔
[21/73e43d] process > prepare_contamination:concat_contamination [100%] 1 of 1 ✔
[9f/7bc61a] process > clean:minimap2 (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > clean:sort_bam -
[- ] process > clean:index_bam -
[- ] process > clean:idxstats_from_bam -
[- ] process > clean:flagstats_from_bam -
[- ] process > clean:split_bam -
[- ] process > clean:index_bam2 -
[- ] process > clean:fastq_from_bam -
[18/01ff03] process > qc:fastqc (1) [100%] 1 of 1 ✔
[- ] process > qc:multiqc [ 0%] 0 of 1
ERROR ~ Error executing process > 'clean:minimap2 (1)'
Caused by:
Process `clean:minimap2 (1)` terminated with an error exit status (1)
Command executed:
minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam
Command exit status:
1
Command output:
(empty)
Command error:
[M::mm_idx_gen::118.418*1.88] collected minimizers
[main_samview] fail to read the header from "-". Maybe we have to move some of these errors as issues to finalize this reorganization PR. |
…ediate files are skipped.
@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here. |
ok thx! I am also not sure why nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa -profile local,docker
ERROR ~ Error executing process > 'clean:minimap2 (1)'
Caused by:
Process `clean:minimap2 (1)` terminated with an error exit status (1)
Command executed:
minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam
Command exit status:
1
Command output:
(empty)
Command error:
[M::mm_idx_gen::117.876*1.85] collected minimizers
[main_samview] fail to read the header from "-". is happening. But looking into this right now |
alright, forget this problem. I think this is a RAM issue. Everything fine when I am choosing a smaller --host such as eco |
ok, I also checked the Illumina branch. It works for me, but the output still needs the same structuring as the ONT branch. Currently, I get for the command: nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker ❯ tree results
results
├── illumina.mapped_1.fastq.gz
├── illumina.mapped_2.fastq.gz
├── illumina.mapped_singleton.fastq.gz
├── illumina.unmapped_1.fastq.gz
├── illumina.unmapped_2.fastq.gz
├── illumina.unmapped_singleton.fastq.gz
├── intermediate
│ ├── host.fa.fai
│ ├── host.fa.gz
│ └── map-to-remove
│ ├── illumina.mapped_1.fastq.gz
│ ├── illumina.mapped_2.fastq.gz
│ ├── illumina.mapped_singleton.fastq.gz
│ ├── illumina.sorted.bam
│ ├── illumina.sorted.bam.bai
│ ├── illumina.sorted.flagstats.txt
│ ├── illumina.sorted.idxstats.tsv
│ ├── illumina.unmapped_1.fastq.gz
│ ├── illumina.unmapped_2.fastq.gz
│ └── illumina.unmapped_singleton.fastq.gz
├── logs
│ ├── execution_report_2023-12-11_17-33-43.html
│ └── execution_timeline_2023-12-11_17-33-43.html
└── qc
└── multiqc_report.html so the |
Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow. |
thx! Besides, all looks good from my side. (and I can then also test the |
I had to add samtools and tabix (for bgzip) to the seqkit conda env, since all of those tools are necessary to create the block gzipped and indexed reference genome you want. You'll see the changes in the env file, and then I guess you'll need to create a matching container. |
Alright thx! Yes that looks much cleaner ;) I generated a matching container for seqkit and the dependencies and also tested it. nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker works for me. I push the updated config container file. |
Now, from my pov, currently only the correct Illumina output (folders for clean and removed) is missing |
…geting crazy, need to rethink this.
Reorganize the
results/
directory to make it easier to understand. The current goal is:(intermediate files will be prefixed with the sample name as well)
The initial PR disables all publishDir directives and only adds back publishing of clean files when the
--keep
directive is used, renaming files as required. I'll implement the rest of the file publishing in later commits on this same branch/PR.