Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where bin number could overflow when looking for max_off #1595

Merged
merged 1 commit into from
Apr 5, 2023

Conversation

daviesrob
Copy link
Member

When searching for max_off, hts_itr_query() and hts_itr_multi_bam() look for a bin to the right of the end of the region. For whole chromosomes, this would be HTS_POS_MAX, which is far beyond the maximum bin position supported. The bin calculation overflowed leading to either a negative bin number or an incorrect positive one, depending on the number of levels in the index. Negative bin numbers simply caused time to be wasted as the search loop eventually counted up to zero, but incorrect positive ones could cause the iterator to finish too early.

Fix by catching the out-of-bounds case and setting max_off to UINT64_MAX, which should be used for bins beyond the end of the indexable range.

Luckily in practice this didn't cause too much harm, at least for the default min_shift value. Indexes with up to six levels overflowed to negative bin numbers. For seven you got one referencing a region starting at about 17Gbases, and 257Gbases for eight, so it's unlikely searches on real data were affected. The fix is trivial though and avoids some negative value shifts so worth doing.

When searching for `max_off`, hts_itr_query() and
hts_itr_multi_bam() look for a bin to the right of the end of
the region.  For whole chromosomes, this would be HTS_POS_MAX,
which is far beyond the maximum bin position supported.  The `bin`
calculation overflowed leading to either a negative bin number
or an incorrect positive one, depending on the number of levels
in the index.  Negative bin numbers simply caused time to be
wasted as the search loop eventually counted up to zero, but
incorrect positive ones could cause the iterator to finish too
early.

Fix by catching the out-of-bounds case and setting max_off
to UINT64_MAX, whch should be used for bins beyond the end
of the indexable range.
bin++;
// First check if end lies within the range of the index (it won't
// if it's HTS_POS_MAX)
if (end < 1LL << (idx->min_shift + 3 * idx->n_lvls)) {
Copy link
Contributor

@jkbonfield jkbonfield Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a useful shortcut I think, but it looks like using uint64_t for bin would have made the old code work too, as hts_bin_first is then huge and bin gets reset to 0, falling out of the while loop with max_off = UINT64_MAX.

Otherwise we're still using a negative bin below, eg:

            for (i = n_off = 0; i < iter->bins.n; ++i)
                if ((k = kh_get(bin, bidx, iter->bins.a[i])) != kh_end(bidx))
                    n_off += kh_value(bidx, k).n;

Edit: argh! That blasted obscure templating code... ignore me. "bin" isn't a variable here, but a namespace shadowing a variable name. Sigh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I now see how to get a positive legal bin and still cause mayhem, so agreed your solution of checking for the maximum value is definitely the best way of handling it.

$ tabix -m10 -f /tmp/_.bed.gz;./tabix /tmp/_.bed.gz a

That takes 10 sec to query my 5 entry bed file. (Admittedly with debugging on)

@jkbonfield jkbonfield merged commit 93434e0 into samtools:develop Apr 5, 2023
@daviesrob daviesrob deleted the bin_overflow branch April 18, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants