Faster filter #215

averagehat · 2016-03-23T17:50:05Z

Closes #204
This changes ngs_filter to:

Use generators rather than list
Do the filtering only once rather than filtering by index and then by Ns (which requires iterating twice)
Change all calls to print to log.info or log.debug
The call to multiprocessing.Pool should utilize threads from config
If dropNs=false and indexqualitymin=0, then just symlinks the data

In theory this should be significantly faster. But it still looks at every sequence and index sequence, and biopython will load and convert quality, etc. Further optimizations would be possible by skipping that process

necrolyte2 · 2016-03-23T23:01:10Z

So I don't get how it does the diff, but this branch is 3 commits ahead and 61 commits behind master.
This scares me a bit becuase github is not showing us that inside the PR.

If you switch to the branch on the code page on GitHub it shows that.

averagehat · 2016-03-23T23:03:22Z

Yeah, weird. I'll look into this tomorrow.

averagehat · 2016-03-24T19:17:08Z

Apparently my faster code makes the tests run too slow? lol https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/118045722#L2242

edit: Indeed, my "faster" code was written in a delusional state, apparently. It's excessively slow, and not lazy at all because of the use of reduce.

averagehat · 2016-03-24T19:38:10Z

ngs_mapper/nfilter.py

+        dropRead = hasN or indexIsBad
+        total += 1
+        if not dropRead:
+            keptReads = chain([read], keptReads)


It's possible this use of itertools.chain is slow.

edit: it's very slow.

…into faster-filter

averagehat · 2016-03-25T20:00:15Z

Assuming the second build passes, this is ready for review. If not I'll fix it up monday.

necrolyte2 · 2016-03-26T02:19:25Z

Change all calls to print to log.info or log.debug
The call to multiprocessing.Pool should utilize threads from config
How hard would it be to detect if no filter arguments were specified to just symlink and return?

averagehat · 2016-03-28T19:01:29Z

Weird unrelated test failure:

https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/119042228#L2233-L2242

Note that slow_tests pass anyway

averagehat · 2016-03-28T19:13:49Z

Wow that failure was literally a fluke: https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/119050624#L2233-L2235

necrolyte2 · 2016-03-29T02:53:25Z

ngs_mapper/nfilter.py

@@ -104,14 +104,25 @@ def is_valid(fn):
    msg= "Skipped files %s that were not within chosen platforms %s" % ( plat_files, platforms)
    if not files:
        raise ValueError("No fastq or sff files found in directory %s" % readsdir + '\n' + msg)
-    if parallel:
+    if threads > 1:


This looks like it would work fine, but I'm wondering why you took this approach instead of just pool = multiprocessing.Pool(threads)

I didn't know about that option

Faster filter pool

Panciera added 3 commits March 23, 2016 13:47

ngs_filter now uses generators

d0b94bf

remove tuple argument

92dc4c7

fix missing variable definition

a1797d5

averagehat added the in progress label Mar 24, 2016

deleting some comments

36cc00d

averagehat reviewed Mar 24, 2016
View reviewed changes

Panciera added 3 commits March 25, 2016 15:26

moved from reduce to cytoolz.accumulate

bda87ee

explicit generator instead of accumulate

cd828bf

Merge branch 'faster-filter' of https://github.com/VDBWRAIR/ngs_mapper …

8362668

…into faster-filter

Panciera added 4 commits March 28, 2016 11:19

print statements -> logger

719e7a0

filter now symlinks if qual=0 and dropNs=False

951bac3

filter minquality is now 0 to avoid wasted runtime

3197f2f

nfilter now uses threads argument/config for parallelism

1397dc3

averagehat mentioned this pull request Mar 28, 2016

Error version 1.4.0 - index issue? Error needs to be more informative? #224

Closed

fixed arg schema

ff63261

fix 'w'/'a' bug in stats file, where it got overwritten each read

3510db4

averagehat assigned averagehat and necrolyte2 and unassigned averagehat Mar 28, 2016

averagehat added Ready for Review and removed in progress labels Mar 28, 2016

necrolyte2 reviewed Mar 29, 2016
View reviewed changes

attempting to simplify threads for multiprocessing pool

bf18479

necrolyte2 and others added 5 commits March 29, 2016 00:23

fixed incorrect function name in pool.map

3c2f363

fixing map call even good-er

1efc9cd

Merge pull request #228 from VDBWRAIR/faster-filter-pool

b3639f1

Faster filter pool

Merge remote-tracking branch 'origin/master' into faster-filter

3ed7521

fixed symlink by using abspath

da98cc7

necrolyte2 merged commit 3677d30 into master Mar 30, 2016

This was referenced Apr 11, 2016

Faster filter pool #238

Closed

Faster filter new #243

Merged

necrolyte2 deleted the faster-filter branch April 14, 2016 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster filter #215

Faster filter #215

averagehat commented Mar 23, 2016

necrolyte2 commented Mar 23, 2016

averagehat commented Mar 23, 2016

averagehat commented Mar 24, 2016

averagehat Mar 24, 2016

averagehat commented Mar 25, 2016

necrolyte2 commented Mar 26, 2016

averagehat commented Mar 28, 2016

averagehat commented Mar 28, 2016

necrolyte2 Mar 29, 2016

averagehat Mar 29, 2016

Faster filter #215

Faster filter #215

Conversation

averagehat commented Mar 23, 2016

necrolyte2 commented Mar 23, 2016

averagehat commented Mar 23, 2016

averagehat commented Mar 24, 2016

averagehat Mar 24, 2016

Choose a reason for hiding this comment

averagehat commented Mar 25, 2016

necrolyte2 commented Mar 26, 2016

averagehat commented Mar 28, 2016

averagehat commented Mar 28, 2016

necrolyte2 Mar 29, 2016

Choose a reason for hiding this comment

averagehat Mar 29, 2016

Choose a reason for hiding this comment