Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster filter #215

Merged
merged 19 commits into from
Mar 30, 2016
Merged

Faster filter #215

merged 19 commits into from
Mar 30, 2016

Conversation

averagehat
Copy link
Contributor

Closes #204
This changes ngs_filter to:

  • Use generators rather than list
  • Do the filtering only once rather than filtering by index and then by Ns (which requires iterating twice)
  • Change all calls to print to log.info or log.debug
  • The call to multiprocessing.Pool should utilize threads from config
  • If dropNs=false and indexqualitymin=0, then just symlinks the data

In theory this should be significantly faster. But it still looks at every sequence and index sequence, and biopython will load and convert quality, etc. Further optimizations would be possible by skipping that process

@necrolyte2
Copy link
Member

So I don't get how it does the diff, but this branch is 3 commits ahead and 61 commits behind master.
This scares me a bit becuase github is not showing us that inside the PR.

If you switch to the branch on the code page on GitHub it shows that.

@averagehat
Copy link
Contributor Author

Yeah, weird. I'll look into this tomorrow.

@averagehat
Copy link
Contributor Author

Apparently my faster code makes the tests run too slow? lol https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/118045722#L2242

edit: Indeed, my "faster" code was written in a delusional state, apparently. It's excessively slow, and not lazy at all because of the use of reduce.

dropRead = hasN or indexIsBad
total += 1
if not dropRead:
keptReads = chain([read], keptReads)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible this use of itertools.chain is slow.

edit: it's very slow.

@averagehat
Copy link
Contributor Author

Assuming the second build passes, this is ready for review. If not I'll fix it up monday.

@necrolyte2
Copy link
Member

  • Change all calls to print to log.info or log.debug
  • The call to multiprocessing.Pool should utilize threads from config
  • How hard would it be to detect if no filter arguments were specified to just symlink and return?

@averagehat
Copy link
Contributor Author

Weird unrelated test failure:

https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/119042228#L2233-L2242

Note that slow_tests pass anyway

@averagehat
Copy link
Contributor Author

Wow that failure was literally a fluke: https://travis-ci.org/VDBWRAIR/ngs_mapper/builds/119050624#L2233-L2235

@@ -104,14 +104,25 @@ def is_valid(fn):
msg= "Skipped files %s that were not within chosen platforms %s" % ( plat_files, platforms)
if not files:
raise ValueError("No fastq or sff files found in directory %s" % readsdir + '\n' + msg)
if parallel:
if threads > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it would work fine, but I'm wondering why you took this approach instead of just pool = multiprocessing.Pool(threads)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that option

@necrolyte2 necrolyte2 merged commit 3677d30 into master Mar 30, 2016
This was referenced Apr 11, 2016
@necrolyte2 necrolyte2 deleted the faster-filter branch April 14, 2016 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants