Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastq from unmapped reads #189

Merged
merged 12 commits into from
Apr 10, 2019
Merged

Fastq from unmapped reads #189

merged 12 commits into from
Apr 10, 2019

Conversation

maxibor
Copy link
Member

@maxibor maxibor commented Apr 9, 2019

To answer #187
Comes with a python script (using pysam) to recreate fastq files from the unmapped reads.

However this is using Python3 and the Python in the conda env is currently 2.7 :(

@apeltzer apeltzer added the WIP Work in progress label Apr 9, 2019
@maxibor maxibor requested a review from apeltzer April 9, 2019 13:06
@apeltzer
Copy link
Member

apeltzer commented Apr 9, 2019

But the current code already does this?

eager/main.nf

Lines 846 to 898 in bc55df3

process samtools_filter {
tag "$prefix"
publishDir "${params.outdir}/samtools/filter", mode: 'copy',
saveAs: {filename ->
if (filename.indexOf(".fq.gz") > 0) "unmapped/$filename"
else if (filename.indexOf(".unmapped.bam") > 0) "unmapped/$filename"
else if (filename.indexOf(".filtered.bam")) filename
else null
}
input:
file bam from ch_mapped_reads_filter.mix(ch_mapped_reads_filter_cm,ch_bwamem_mapped_reads_filter)
output:
file "*filtered.bam" into ch_bam_filtered_qualimap, ch_bam_filtered_dedup, ch_bam_filtered_markdup, ch_bam_filtered_pmdtools, ch_bam_filtered_angsd, ch_bam_filtered_gatk
file "*.fastq.gz" optional true
file "*.unmapped.bam" optional true
file "*.{bai,csi}"
script:
prefix="$bam" - ~/(\.bam)?/
size = "${params.large_ref}" ? '-c' : ''
if("${params.bam_discard_unmapped}" && "${params.bam_unmapped_type}" == "discard"){
"""
samtools view -h -b $bam -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.filtered.bam
samtools index "${size}" ${prefix}.filtered.bam
"""
} else if("${params.bam_discard_unmapped}" && "${params.bam_unmapped_type}" == "bam"){
"""
samtools view -h $bam | tee >(samtools view - -@ ${task.cpus} -f4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.unmapped.bam) >(samtools view - -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.filtered.bam)
samtools index "${size}" ${prefix}.filtered.bam
"""
} else if("${params.bam_discard_unmapped}" && "${params.bam_unmapped_type}" == "fastq"){
"""
samtools view -h $bam | tee >(samtools view - -@ ${task.cpus} -f4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.unmapped.bam) >(samtools view - -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.filtered.bam)
samtools index "${size}" ${prefix}.filtered.bam
samtools fastq -tn ${prefix}.unmapped.bam | pigz -p ${task.cpus} > ${prefix}.unmapped.fastq.gz
rm ${prefix}.unmapped.bam
"""
} else if("${params.bam_discard_unmapped}" && "${params.bam_unmapped_type}" == "both"){
"""
samtools view -h $bam | tee >(samtools view - -@ ${task.cpus} -f4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.unmapped.bam) >(samtools view - -@ ${task.cpus} -F4 -q ${params.bam_mapping_quality_threshold} -o ${prefix}.filtered.bam)
samtools index "${size}" ${prefix}.filtered.bam
samtools fastq -tn ${prefix}.unmapped.bam | pigz -p ${task.cpus} > ${prefix}.unmapped.fastq.gz
"""
} else { //Only apply quality filtering, default
"""
samtools view -h -b $bam -@ ${task.cpus} -q ${params.bam_mapping_quality_threshold} -o ${prefix}.filtered.bam
samtools index "${size}" ${prefix}.filtered.bam
"""
}
}

It even extracts the unmapped data to either BAM, FastQ depending on the users choice. I think we need something else :-(

@jfy133
Copy link
Member

jfy133 commented Apr 9, 2019

@maxibor are these fastq reads pre Trimming and merging, but without human reads?

@maxibor
Copy link
Member Author

maxibor commented Apr 9, 2019

@maxibor are these fastq reads pre Trimming and merging, but without human reads?

in this PR, yes

@maxibor
Copy link
Member Author

maxibor commented Apr 9, 2019

It even extracts the unmapped data to either BAM, FastQ depending on the users choice. I think we need something else :-(

But this step works on post AR fastq files

@jfy133 jfy133 closed this Apr 9, 2019
@jfy133 jfy133 reopened this Apr 9, 2019
@jfy133
Copy link
Member

jfy133 commented Apr 9, 2019

Sorry - that close was courtesy of Maia

@jfy133
Copy link
Member

jfy133 commented Apr 9, 2019

I think you are good to go for testing once python version fixed.

Minor thing: the help message/description is unspecific of what actually is being output. Thus maybe Alex' confusion

@maxibor
Copy link
Member Author

maxibor commented Apr 9, 2019

Updated Conda env so Pysam should now come. But the test should fail because the Docker container isn't rebuilt as of now with the new Pysam dependancy

@maxibor maxibor requested a review from jfy133 April 9, 2019 15:13
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Official review)

Looking more in depth (please correct me if I misunderstand anything).

Major

I can't comment on the python section. However,

  1. The BAM file as input into this tool should be immediately after BWA (sorted.bam) as it should include all possible mapped reads, not just ones that mapped exactly. i.e. if you use the reads post-samtools filter -q 37, the discarded mapped reads would not be filtered by the new module, because they are removed from the mapped BAM file after the samtools filter.

  2. Is L921 flipped? Shouldn't it be if (params.singleEnd) { - as you only indicate a single output file in that conditional block?

Minor

For the process name (L906), flag itself (L363, L218) and help message (L108), I suggest the following extra precision instead of -unmap:

--strip_input_fastq                       Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)

publishDir should renamed to something like /samtools/stripped_fastq rather than unmapped_fastq, as the latter file already exists with the --extract_unmapped functionality.

Equally, the output FASTQ names should be e.g. stripped.fwd.fq.gz and stripped.rev.fq.gz or something. I think R1/R2 would be dangerous for novices re-analyzing the data with the same pipeline, if they forget to add underscores (i.e. there are two R1/R2s in the name). This may then lead to funky input regex for a EAGER2 re-run and subsequent errors.

The actual names/flags can be discussed here - you don't have to go exactly with my 'stripped' suggestion.

@maxibor
Copy link
Member Author

maxibor commented Apr 10, 2019

Followed the suggestions of @jfy133
Also user can now choose the strip mode: Either stripping completely the mapped reads (--strip_mode strip) or just replace the sequence of the mapped reads by N (--strip_mode replace)

@jfy133
Copy link
Member

jfy133 commented Apr 10, 2019

@ivelsko owes you a beer ;)

@maxibor
Copy link
Member Author

maxibor commented Apr 10, 2019

@apeltzer Can you have a look and merge if ok ?
The tests are working with the updated docker image using the new conda env. Travis should be fixed after the docker image is rebuilt.

Copy link
Member

@apeltzer apeltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍

@apeltzer
Copy link
Member

Just failing because of missing python3.6 and pysam - merging to get the dev image updated 👍

@apeltzer apeltzer merged commit 4d2f062 into nf-core:dev Apr 10, 2019
@maxibor maxibor mentioned this pull request Apr 13, 2019
2 tasks
apeltzer added a commit that referenced this pull request Apr 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants