Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use read pairing information in kneaddata_bowtie2_discordant_pairs #16

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wbazant
Copy link

@wbazant wbazant commented Feb 17, 2021

I had a problem with kneaddata_bowtie2_discordant_pairs taking increasingly more memory based on the input size. I've tracked it down to this script having to re-order the alignment information, for which it was storing IDs in a set.

I have reimplemented the script to avoid this step, by letting it assume the two paired read files are indeed paired, and making bowtie2 runs separately for each mate with --reorder. The output is then more easily interpreted by iterating through the results.

The changed script no longer works for out of order paired reads. It accommodates one mate of the pair being truncated, and there's enough smartness in the script to check that its assumption is holding true, but not more. I'm not sure how controversial it is - if kneaddata ever worked for that case, or if the previous step of Trimmomatic will fix it - and if there is any reason for a bioinformatics tool to support for out of order paired reads, but I want to explicitly note it because it's I guess a drawback.

which avoids the problem of having to pair up the reads afterwards
so the memory usage doesn't depend on the input size

Add more logging, turned on after the --verbose option (if not verbose, print only the counts, as before)

Add --bypass-trf for unit tests that are meant to be bowtie2 only
@wbazant
Copy link
Author

wbazant commented Feb 17, 2021

How the script sorts inputs into outputs or really anything major was not intended to be changed. For the sake of being thorough, here are behaviour changes that come with the change:

  1. different order of reads in the outputs (i.e. it is no longer a pseudorandom order)
  2. different order of creation times in the outputs, when one runs ls -l it looks
  3. I've added --verbose to the API which turns on extra logging. The default is to log just the summary counts but I see now that I've changed the format, I'll happily change it back if you want:
# previous
02/08/2021 06:58:33 PM - kneaddata.utilities - DEBUG: b'13578393 reads; of these:\n  13578393 (100.00%) were unpaired; of these:\n    13469836 (99.20%) aligned 0 times\n    16602 (0.12%) aligned exactly 1 time\n    91955 (0.68%) aligned >1 times\n0.80% overall alignment rate\npair1_aligned : 48537\npair2_aligned : 48537\npair1_unaligned : 6363994\npair2_unaligned : 6363994\norphan1_aligned : 8875\norphan2_aligned : 2608\norphan1_unaligned : 581249\norphan2_unaligned : 160599\n'

# new
02/16/2021 04:46:29 PM - kneaddata.utilities - DEBUG: b'2021-02-16 16:45:57,331 kneaddata_bowtie2_discordant_pairs - Paired reads: 444231 both aligned, 5969951 both unaligned, 0 only pair1 aligned, 0 only pair2 aligned\n2021-02-16 16:46:24,551 kneaddata_bowtie2_discordant_pairs - Orphan reads: 59896 aligned, 690133 unaligned\n'

@wbazant wbazant changed the title Use read pairing information in kneaddata_discordant_pairs Use read pairing information in kneaddata_bowtie2_discordant_pairs Feb 18, 2021
@wbazant
Copy link
Author

wbazant commented Jul 27, 2021

A note about this fork: it errors out when given paired-end fastqs where the reads don't match between mates, with an error message like
ValueError: sam files do not match on line 5: found IDs C5D92ACXX141105:3:1101:10398:4098/1 and C5D92ACXX141105:3:1101:10374:27723/2.

I actually quite like this behaviour, and exclude those from my pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants