Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in samtools stats output - cram index issue? #1639

Closed
jts opened this issue Jul 5, 2023 · 3 comments · Fixed by #1640
Closed

Inconsistency in samtools stats output - cram index issue? #1639

jts opened this issue Jul 5, 2023 · 3 comments · Fixed by #1640

Comments

@jts
Copy link

jts commented Jul 5, 2023

Hi,

I recently ran into a strange problem where the output of samtools stats differed between two invocations that I expected to be equivalent. I was trying to generate the stats for a single chromosome and initially ran:

samtools stats example.cram chrM

The results reported far fewer reads than expected, so I re-ran the command providing the full range for this chromosome and got the expected number of reads:

samtools stats example.cram chrM:1-19154

By definition these two range specifiers should be the same so I spent a bit of time looking into this and I think I tracked it down to a difference in what cram_index_last and cram_index_last_query return. I've attached a minimal example (sim1.cram) to reproduce the problem. This is simply 500 full-length copies of a mtDNA sequence, with 1% errors introduced.

With this .cram file samtools stats sim1.cram chrM reports 257 total reads for this chromosome but samtools stats sim1.cram chrM:1-19154 reports 486.

Here's a snippet of code to dump the cram_index* returned by the two methods:

#include "htslib/cram/cram.h"
#include "htslib/cram/cram_index.h"
#include "htslib/hts.h"

int main(int argc, char** argv)
{
    if(argc < 2) {
        return 1;
    }

    cram_fd* cram = cram_open(argv[1], "r");
    cram_index_load(cram, argv[1], NULL);

    cram_index* last = cram_index_last(cram, 5 /* tid of chrM */, NULL);
    cram_index* last_query = cram_index_query_last(cram, 5, 19155);

    printf("index from cram_index_last       span: [%d %d] slice: %d len: %d\n", last->start, last->end, last->slice, last->len);
    printf("index from cram_index_query_last span: [%d %d] slice: %d len: %d\n", last_query->start, last_query->end, last_query->slice, last_query->len);

}

On sim1.cram this code prints:

index from cram_index_last       span: [1 19524] slice: 396 len: 5291
index from cram_index_query_last span: [1 19524] slice: 396 len: 5176

which correspond to the first and second (last) entry in the .crai file for this chromosome (tid 5):

5       1       19524   16677   396     5291
5       1       19524   22388   396     5176
-1      0       1       27985   226     245

I don't know the cram format well enough to debug this any further but I hope this points in the right direction.

Jared
sim1.zip

@jkbonfield
Copy link
Contributor

Thank you for the very clear bug report. I can reproduce it. That's disappointing as I thought we'd recently bug fixed that function! I wonder if the bug fix itself broke something else.

Yes, this will most definitely point us in the right direction.

@jkbonfield
Copy link
Contributor

FWIW the bug I fixed before wasn't this, but a related issue elsewhere. The cause and fix are both trivial.

Many thanks for the clear bug report and data to reproduce it.

@jts
Copy link
Author

jts commented Jul 6, 2023

Thanks for the quick fix (and shoutout on twitter!), I tested the fix on the large dataset where I first noticed the problem and can confirm I get the expected result now.

vasudeva8 pushed a commit to vasudeva8/htslib that referenced this issue Aug 17, 2023
The index is a loaded into a nested containment list, so the last
entry in the index array is not necessarily the last slice, as the
last slice may be entirely contained within a previous one.

Fixes samtools#1639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants