Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round redistributed read floats #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

scwatts
Copy link

@scwatts scwatts commented Nov 20, 2018

Hi Jennifer and Florian,

Thank you for developing Bracken, I've found it very useful in my work and I think it's a great addition to the Kraken classifier. I've discovered what may be a small bug and I thought I'd contribute back to your project by making a minor pull request.

I noticed that Bracken represents redistributed reads as floating point numbers and during output are cast to integers without rounding. This can cause off-by-one errors for the reported counts of a taxa. The affect of this is amplified in the Kraken-format report whereby these off-by-one inaccuracies are summed through the taxonomic tree and can result in larger discrepancies between the number of reads assigned to a clade (inclusive of all children nodes) and the sum of read counts assigned to individual nodes of that clade. Below is an excerpt from a report generated with Bracken which is affected by the described issue:

7.24    3183985 0       G       838                       Prevotella
2.64    1162821 1162821 S       52227                       Prevotella dentalis
1.84    808570  808570  S       76123                       Prevotella enoeca
1.19    523928  523928  S       28129                       Prevotella denticola
0.49    216241  216241  S       652716                      Prevotella sp. oral taxon 299
0.37    164928  164928  S       839                         Prevotella ruminicola
0.48    208994  208994  S       28131                       Prevotella intermedia
0.11    49136   49136   S       589436                      Prevotella fusca
0.07    30195   30195   S       1177574                     Prevotella jejuni
0.02    9006    9006    S       28132                       Prevotella melaninogenica
0.02    10160   10160   S       589437                      Prevotella scopos

Here the read count assigned to the Prevotella clade (3,183,985) does not equal the sum of reads assigned at each of the children nodes (3,183,979).

This pull request patches src/est_abundance.py to fix this behaviour by rounding the redistributed read counts prior to writing out the default Bracken report and the Kraken-style output format.

Rather than casting redistributed read counts to integers during output
and ignoring the decimal, it seems preferable to round these floats
instead
@scwatts
Copy link
Author

scwatts commented Nov 21, 2018

I thought about this further over lunch I don't think this is a complete solution for the Kraken-style output format. Applying the suggested rounding will mitigate the accumulative off-by-one errors but there may still be discrepancies of -1/+1 between read count of a clade and the sum of reads assigned at each child node.

An alternative may be to perform rounding during estimation of true reads, prior to propagating these counts through the tree (i.e. rounding here https://github.com/jenniferlu717/Bracken/blob/master/src/est_abundance.py#L313). While this strictly preserves the child node counts and forces the parent clade to have consistent counts, the total assigned reads may be altered from the input. Another thought here could be to adjust the children node counts such that the clade and input total is preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant