Round redistributed read floats #59

scwatts · 2018-11-20T23:37:23Z

Hi Jennifer and Florian,

Thank you for developing Bracken, I've found it very useful in my work and I think it's a great addition to the Kraken classifier. I've discovered what may be a small bug and I thought I'd contribute back to your project by making a minor pull request.

I noticed that Bracken represents redistributed reads as floating point numbers and during output are cast to integers without rounding. This can cause off-by-one errors for the reported counts of a taxa. The affect of this is amplified in the Kraken-format report whereby these off-by-one inaccuracies are summed through the taxonomic tree and can result in larger discrepancies between the number of reads assigned to a clade (inclusive of all children nodes) and the sum of read counts assigned to individual nodes of that clade. Below is an excerpt from a report generated with Bracken which is affected by the described issue:

7.24    3183985 0       G       838                       Prevotella
2.64    1162821 1162821 S       52227                       Prevotella dentalis
1.84    808570  808570  S       76123                       Prevotella enoeca
1.19    523928  523928  S       28129                       Prevotella denticola
0.49    216241  216241  S       652716                      Prevotella sp. oral taxon 299
0.37    164928  164928  S       839                         Prevotella ruminicola
0.48    208994  208994  S       28131                       Prevotella intermedia
0.11    49136   49136   S       589436                      Prevotella fusca
0.07    30195   30195   S       1177574                     Prevotella jejuni
0.02    9006    9006    S       28132                       Prevotella melaninogenica
0.02    10160   10160   S       589437                      Prevotella scopos

Here the read count assigned to the Prevotella clade (3,183,985) does not equal the sum of reads assigned at each of the children nodes (3,183,979).

This pull request patches src/est_abundance.py to fix this behaviour by rounding the redistributed read counts prior to writing out the default Bracken report and the Kraken-style output format.

Rather than casting redistributed read counts to integers during output and ignoring the decimal, it seems preferable to round these floats instead

scwatts · 2018-11-21T03:22:19Z

I thought about this further over lunch I don't think this is a complete solution for the Kraken-style output format. Applying the suggested rounding will mitigate the accumulative off-by-one errors but there may still be discrepancies of -1/+1 between read count of a clade and the sum of reads assigned at each child node.

An alternative may be to perform rounding during estimation of true reads, prior to propagating these counts through the tree (i.e. rounding here https://github.com/jenniferlu717/Bracken/blob/master/src/est_abundance.py#L313). While this strictly preserves the child node counts and forces the parent clade to have consistent counts, the total assigned reads may be altered from the input. Another thought here could be to adjust the children node counts such that the clade and input total is preserved.

Round redistributed read floats

a144b40

Rather than casting redistributed read counts to integers during output and ignoring the decimal, it seems preferable to round these floats instead

robmaz mentioned this pull request Apr 8, 2020

How does the fraction_total_reads calculated in bracken report? #66

Closed

sconlan mentioned this pull request Sep 7, 2023

Counts at species level don't sum to root #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round redistributed read floats #59

Round redistributed read floats #59

scwatts commented Nov 20, 2018

scwatts commented Nov 21, 2018

Round redistributed read floats #59

Are you sure you want to change the base?

Round redistributed read floats #59

Conversation

scwatts commented Nov 20, 2018

scwatts commented Nov 21, 2018