Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam_itr_regarray not reporting reads #1569

Closed
Mesh89 opened this issue Feb 23, 2023 · 1 comment · Fixed by #1574
Closed

sam_itr_regarray not reporting reads #1569

Mesh89 opened this issue Feb 23, 2023 · 1 comment · Fixed by #1574
Assignees

Comments

@Mesh89
Copy link

Mesh89 commented Feb 23, 2023

In the file s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3988780/HG00512.final.cram the region chr7:106092666-106093543 contains 286 reads.

The following code, correctly outputs all of the starting positions of those reads when argv[1] is the path of HG00512.final.cram:

samFile* fp = sam_open(argv[1], "r");
bam_hdr_t* hdr = sam_hdr_read(fp);
hts_idx_t* idx = sam_index_load(fp, argv[1]);

bam1_t* read = bam_init1();
hts_itr_t* iter = sam_itr_querys(idx, hdr, "chr7:106092666-106093543");
while (sam_itr_next(fp, iter, read) >= 0) {
    std::cout << read->core.pos << std::endl;
}

However, the following code, which I expect to be equivalent, only outputs one read for me

samFile* fp = sam_open(argv[1], "r");
bam_hdr_t* hdr = sam_hdr_read(fp);
hts_idx_t* idx = sam_index_load(fp, argv[1]);

bam1_t* read = bam_init1();
char* region[] = { "chr7:106092666-106093543" };
hts_itr_t* iter = sam_itr_regarray(idx, hdr, region, 1);
while (sam_itr_next(fp, iter, read) >= 0) {
    std::cout << read->core.pos << std::endl;
}
@daviesrob
Copy link
Member

It looks like this is a bug in the index look-up code. In particular, cram_index_query_last() returns a location before the start of the region you want, which makes the iterator finish too early. It may be related to the fact that this file has cram containers which maps to a genomic region that's completely contained by an earlier one.

The exact cause isn't known yet, but it should be tracked down soon.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 28, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 28, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 1, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 1, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 1, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Mar 1, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes samtools#1569
daviesrob pushed a commit that referenced this issue Mar 6, 2023
The cram_index_query_last used by sam_itr_regarray had problems when
dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in
memory until a containment is found, at which point the pointers will
be to an entirely different array.  This breaks naive pointer
comparisons.

The cram_index struct already had a "next" field holding the file
offset of the next container.  This has been replaced by e_next
pointing to the next cram_entry struct in file ordering, and
e_next->offset is equivalent to the old "next".  This allows
consumption of the index either as the original nested containment
list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was
incorrect before.  We never used this function and it's not public,
but we now use it within the rewrite of cram_index_query_last.

Fixes #1569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants