-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow repeated calls of bcf_sr_set_regions #1624
Conversation
and make repeated bcf_sr_seek()+next_line() calls consistent. Resolves samtools#1623 and samtools/bcftools#1918
Awaits the merge of samtools/htslib#1624 Resolves #1918
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your test program works better after this change, but still shows a difference between VCF and BCF:
/tmp/test_synced_reader /tmp/test.bcf
seek: 0
1
seek: 0
1
seek: 0
1
/tmp/test_synced_reader /tmp/test.vcf.gz
seek: -1
1
seek: -1
1
seek: 0
1
So when the target chromosome doesn't appear in the input file, bcf_sr_seek()
returns -1 for vcf and 0 for bcf. I guess this comes from differences between tbx_name2id()
and bcf_hdr_name2id()
and also how the underlying indexes work. Would it be possible the make them work in the same way?
Ideally, if you ask for a chromosome that's in the header but not present in the file, I think it should return 0 but put the reader into an EOF-like state so that the caller will get no data for it and move quickly on to the next region. For chromosomes not in the header, arguably it should return -1 instead. But can we do this in VCF, where listing chromosomes in the header is essentially optional...?
htslib/synced_bcf_reader.h
Outdated
* first call to bcf_sr_add_reader(). | ||
* API notes: | ||
* - bcf_sr_set_targets MUST be called before the first call to bcf_sr_add_reader() | ||
* - bcf_sr_set_regions AFTER readers where initialized will reposition the readers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* - bcf_sr_set_regions AFTER readers where initialized will reposition the readers | |
* - calling bcf_sr_set_regions AFTER readers have been initialized will reposition the readers |
In past we tried to use the indexes to print some stats and realized they don't carry the same information. I don't remember the details, but it had something to do with contig names not being part of the tbi index, I think. Therefore for this issue I concluded in the end that the return status cannot be made fully equivalent for both and looked for a different solution. |
If you are proposing that it is correct that BCF and VCF have different return values for certain situations, then they should be documented. I can see no information on the expected return values in this PR. Personally though, I think they should be massaged to be equal as far as is possible, with the potential (documented) caveat for VCF files that have no header. They aren't the norm from what I can see. |
I thought that might be the case. Although it may be possible to get more consistency by adding this change:
So as long as the header has Here's the test program output with that change:
|
I expanded the original test a bit, which showed up another issue. The updated VCF file is:
and the test program now looks like this: #include <stdio.h>
#include <stdarg.h>
#include "htslib/synced_bcf_reader.h"
void error(const char *format, ...)
{
va_list ap;
va_start(ap, format);
vfprintf(stderr, format, ap);
va_end(ap);
exit(-1);
}
void print_next_line(bcf_srs_t *sr)
{
int count = bcf_sr_next_line(sr);
bcf1_t *line = count > 0 ? bcf_sr_get_line(sr, 0) : NULL;
fprintf(stderr, "count %d\n", count);
if (line) {
fprintf(stderr, " tid %d pos %"PRIhts_pos"\n", line->rid, line->pos);
} else {
fprintf(stderr, " NULL\n");
}
}
int main(int argc, char *argv[])
{
bcf_srs_t *sr = bcf_sr_init();
sr->require_index = 1;
if ( bcf_sr_add_reader(sr,argv[1])!=1 ) error("Unable to read %s\n",argv[1]);
fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"17",0));
print_next_line(sr);
fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"18",0));
print_next_line(sr);
fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"19",0));
print_next_line(sr);
fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"20",0));
print_next_line(sr);
bcf_sr_destroy(sr);
return 0;
} So it tries to access a missing chromosome 18, which is no longer at the start of the file, and it also simulates fetching the next line and printing out the tid and position for the record returned. Running it on the original (
So when diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index 702f260e..a6f901a8 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -872,7 +872,12 @@ int bcf_sr_seek(bcf_srs_t *readers, const char *seq, hts_pos_t pos)
// - find the requested iseq (stored in the seq_hash)
// - position regions to the requested position (bcf_sr_regions_overlap)
bcf_sr_seek_start(readers);
- if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) readers->regions->iseq = i;
+ if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) {
+ readers->regions->iseq = i;
+ } else {
+ // Ensure we don't try to continue on from a non-existent reference seq
+ readers->regions->iseq = -1;
+ }
_bcf_sr_regions_overlap(readers->regions, seq, pos, pos, 0);
for (i=0; i<readers->nreaders; i++) The result is then:
So when |
Following up on our discussion offline, the There I don't immediately see if it can be replaced with the updated |
Thanks. I'll see if I can come up with something that exercises the missing chromosome case a bit better. Given the prefect overlap requirement for the |
When two chunks overlap, i.e. share a genomic region, within that region all sites found in one file must be present also in the other or else they are dropped. I believe the program makes the assumption that the VCFs come in the correct order. However, it shouldn't be difficult to reorder to make it robust against swapping of two input files. |
The file ordering certainly makes a difference for vcf. Here's a couple of test inputs, derived from concat.4.a.vcf and concat.4.b.vcf](https://github.com/samtools/bcftools/blob/develop/test/concat.4.b.vcf): concat.6.a.vcf:
concat6.b.vcf:
Trying
So the second is incorrectly ordered compared to the header, and also includes I think this comes from how the references are added to the synced reader regions list. Switching to
While if you try with bcf, you'll find that the ordering is the same as the first case for both ways round. The code that adds new entries to the regions list is in bcf_sr_add_reader(). For the vcf files, it's using I think it would help to get a more consistent output if the reader always used (Unfortunately even with this, In other news, I've confirmed that this fix, with or without my tweak, makes the example in samtools/bcftools#1918 work with |
Just a quick note before analyzing the entire comment: in VCF the order of contigs in the header does not have to match the order of contigs in the body, so this bit is perfectly fine. |
I did wonder if that might be the case. However, I think there's something to be said for using the ordering in the header if available as it makes vcf and bcf inputs behave in a more similar way. And it may also help the aim of getting |
I've looked more at Other than that, I don't think it's possible for As the |
Awaits the merge of samtools/htslib#1624 Resolves #1918
and make repeated bcf_sr_seek()+next_line() calls consistent. Resolves samtools#1623 and samtools/bcftools#1918
and make repeated bcf_sr_seek()+next_line() calls consistent.
Resolves #1623 and samtools/bcftools#1918