-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use read pairing information in kneaddata_bowtie2_discordant_pairs #16
base: master
Are you sure you want to change the base?
Conversation
which avoids the problem of having to pair up the reads afterwards so the memory usage doesn't depend on the input size Add more logging, turned on after the --verbose option (if not verbose, print only the counts, as before) Add --bypass-trf for unit tests that are meant to be bowtie2 only
How the script sorts inputs into outputs or really anything major was not intended to be changed. For the sake of being thorough, here are behaviour changes that come with the change:
|
A note about this fork: it errors out when given paired-end fastqs where the reads don't match between mates, with an error message like I actually quite like this behaviour, and exclude those from my pipeline. |
I had a problem with kneaddata_bowtie2_discordant_pairs taking increasingly more memory based on the input size. I've tracked it down to this script having to re-order the alignment information, for which it was storing IDs in a set.
I have reimplemented the script to avoid this step, by letting it assume the two paired read files are indeed paired, and making bowtie2 runs separately for each mate with --reorder. The output is then more easily interpreted by iterating through the results.
The changed script no longer works for out of order paired reads. It accommodates one mate of the pair being truncated, and there's enough smartness in the script to check that its assumption is holding true, but not more. I'm not sure how controversial it is - if
kneaddata
ever worked for that case, or if the previous step of Trimmomatic will fix it - and if there is any reason for a bioinformatics tool to support for out of order paired reads, but I want to explicitly note it because it's I guess a drawback.