Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter & trim #16

Merged
merged 38 commits into from
Feb 11, 2016
Merged

filter & trim #16

merged 38 commits into from
Feb 11, 2016

Conversation

averagehat
Copy link
Contributor

No description provided.

@averagehat averagehat changed the title added jip template filter & trim Feb 3, 2016
@averagehat
Copy link
Contributor Author

"By default, the filtering options discard or redirect the read pair if any of the two reads fulfill the criteria."
http://cutadapt.readthedocs.org/en/stable/guide.html#filtering-reads

@necrolyte2
Copy link
Member

so essentially that is saying that if any read is filtered that both reads will be taken out and put into the "filtered" file
That is, you will always have properly paired forward and reverse files for output

So then you have to somehow fetch the filtered out reads from some file?

@necrolyte2
Copy link
Member

I'm also trying to figure out why jip is not parsing the docopt correctly. it seems like it should work to do something like
You can see here that at least the library is working as expected

from jip.vendor.docopt import docopt
d=docopt('''\nUsage:\n  test (-j <j> | -x <x>)\n\nOptions:\n   -j <j>\n   -x <x>\n''', ['-j', '5', '-x', '6'])
print d
Usage:
  test (-j <j> | -x <x>)
d=docopt('''\nUsage:\n  test (-j <j> | -x <x>)\n\nOptions:\n   -j <j>\n   -x <x>\n''', ['-j', '5'])
print d
{'-j': '5',
 '-x': None}

@averagehat
Copy link
Contributor Author

from jinja2 import Template
print Template(open("templates/jip.jinja").read()).render(json.loads(open("templates/filter.json").read()))

@necrolyte2
Copy link
Member

So here you can see he does some crazy code where he parses the docstring
https://github.com/thasso/pyjip/blob/master/jip/options.py#L1384
I think he does all of this so special tags like stdin and stdout are created correctly and such

The downfall is that you lose the ability to use all the normal docopt features like mutually exclusive groups

@necrolyte2
Copy link
Member

thasso/pyjip#62

@necrolyte2
Copy link
Member

Well, we could investigate running jobs straight from template + json

python -c 'from jip.tools import ScriptTool; import json; from jinja2 import Template; tool_src=Template(open("templates/jip.jinja").read()).render(json.loads(open("templates/filter.json").read())); import sys; t=ScriptTool.from_string(tool_src); t.init(); import jip; jip.jobs.create_jobs(t, args=sys.argv[1:])
' -f f -r r -u u

@averagehat
Copy link
Contributor Author

On a related note, I would reccomend doing validation using either
schematic
or with json schemas [example(https://github.com/VDBWRAIR/bioframework/blob/8d3f95aa2567830492e8e997838ae36b53828b93/sim/schema.json)

json schemas have the advantage of being very clear and being just json but schematic is more flexible because you can use any python function (not unlike python contracts)

@necrolyte2
Copy link
Member

I just realized that in our template the Usage line is not printing the options from json file.

@necrolyte2
Copy link
Member

Discussion:

  • All stages accepting paired reads should accept interleave
  • The input arguments should be -s for "single read" and -f -r for "forward" and "reverse
  • Stages outputting fastq should support outputting interleave as well
  • All stages should output command they are running?

@necrolyte2
Copy link
Member

Just to document discussion:

  • Stages accepting reads should accept interleave only(unpaired as well)
    [<input>] [-p] would be the docopt string to allow for standard input
  • Output will be interleave as well

@necrolyte2
Copy link
Member

marcelm/cutadapt#174

@averagehat
Copy link
Contributor Author

I'm comfortable merging this

@necrolyte2
Copy link
Member

The jip tools don't conform to the
[<input>] [-p] schema yet

  • index_filter_quality.jip
  • interleaved_index_filter.jip

Should this PR represent the completion of filter_reads and trim_reads ?

@averagehat
Copy link
Contributor Author

This is ready for merging
NB: this is the io matrix, and something like that might be more generalized in Biotest

@necrolyte2
Copy link
Member

What is that iomatrix doing? Hard to tell by just reading it

@averagehat
Copy link
Contributor Author

yeah I will have to re-write it sometime. Basically it defines a matrix of choices like
[stream, file](where stream is stdin/stdout) as a hypothesis strategy. So in the test that uses it, that test will get a random combination of stream/real file for input/output.

Basically it does this by baking the options in to sh (equivalent to partial application) and ensuring the actual fastq strings are in whatever stream/file gets baked in for input. Unfortunately that means the make_io_matrix needs the sequences that at are actually generated by hypothesis in order to put them there. So make_io_matrix is actually flatmapped onto reads_and_indices. The strategy returns functions so that the actual I/O can be delayed (note that if you assume a condition that fails then these will get thrown away and the I/O time would've been wasted). The I/O functions are used in the test which receives that strategy to get the results.

This is not super simple but it is fairly generalized as it only replaces the input/output. The point is you can put whatever you want into the body of your test and have your test test whatever you want to test.

necrolyte2 added a commit that referenced this pull request Feb 11, 2016
@necrolyte2 necrolyte2 merged commit bec2f45 into dev Feb 11, 2016
@averagehat averagehat mentioned this pull request Feb 11, 2016
4 tasks
@necrolyte2 necrolyte2 deleted the filter-trim branch February 12, 2016 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants