-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add krakenuniq summary kmer filter #998
base: master
Are you sure you want to change the base?
Conversation
…ort_filter update read counts and percentages in metagenomics.py::krakenuniq_report_filter to reflect nodes removed based on filtering
Fun, you did it! I think the whole time I was thinking of this little task, I always figured that the only way I'd really understand how it worked and whether it was working properly was to stare at how it handled a small suite of unit tests on really simple/stripped down inputs demonstrating the various edge cases. Something like:
Not sure if that's quite the right set of test cases so feel free to rethink it. But can you add unit tests? Also: likely for a separate PR, but it'd be nice to add an optional param to provide a specific exclusion list of taxids (which would always remove those nodes including any lower ranking nodes beneath it.. and then recompute/sum upwards). |
Yup, I'll add unit tests—just wanted to get this open to avoid duplication of effort in case this was on anyone else's agenda. |
add new test type: assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0
add test class TestKrakenUniqSummaryFilter, with first test case: test_unchanged_report
self._test_report("should_have_leaves_trimmed.txt") | ||
|
||
#def test_higher_node_trimmed_after_resumming(self): | ||
# self._test_report("should_have_higher_node_trimmed_after_resumming.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything wrong with these tests at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing wrong--they're just not implemented yet (have to make the synthetic input/output).
test/__init__.py
Outdated
@@ -111,6 +114,47 @@ def inputs(self, *fnames): | |||
'''Return the full filenames for files in the test input directory for this test class''' | |||
return [self.input(fname) for fname in fnames] | |||
|
|||
def assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh man -- this test function will definitely come in handy
extend assertApproxEqualValuesInDelimitedFiles() test to use dicts in the case where a header line of field names is available so value comparisons can be made independent of column order. This is only used if use_first_processed_line_for_fieldnames=True. is_number() moved to util.misc; zip_dicts() added to util.misc
Don't mind if I pull this into viral-classify? |
@yesimon please do -- but on a separate branch/pr from the refactor one (which will take longer to vet) |
Looks like this never got moved over to viral-classify; shall we? |
To address broadinstitute/viral-classify#1, this adds a new command,
krakenuniq_report_filter
, tometagenomics.py
:The behavior of this command is such that using a depth-first traversal, the lowest rows in the report have the value in the field specified by
--fieldToFilterOn
(default:uniq_kmers
) zeroed out if their value is below the threshold given by--keepAboveN
(default:100
). Under the assumption that higher taxonomic levels have cumulative values including contributions from the zeroed-out rows, the values of the selected field in higher levels are reduced by the amount contained within lower-levels that were below the specified threshold (the subtraction is propagated up the tree to the root node). Since the traversal is depth-first, the higher levels are eventually re-evaluated to see if they no longer meet the threshold after being subjected to subtraction of their lower levels.The hierarchy of rows is read based on the indentation of the
taxName
column since many rows do not have a formal taxonomic rank assigned (i.e. their rank is "no rank
").Secondarily, the fields specified by
--fieldToAdjust
(default:num_reads
) are similarly adjusted using the conditional threshold established by--keepAboveN
and--fieldToFilterOn
: their values are subtracted similarly, with propagation up the tree.After adjustment to these counts across the entire tree, the part-of-whole percentages for the rows are reflected to reflect the new read counts. The resulting tree is then written out to a new KrakenUniq-format report, filtered to include only those rows meeting the initial threshold criterion.