Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow repeated calls of bcf_sr_set_regions #1624

Merged
merged 2 commits into from
Jun 21, 2023
Merged

Conversation

pd3
Copy link
Member

@pd3 pd3 commented May 25, 2023

and make repeated bcf_sr_seek()+next_line() calls consistent.

Resolves #1623 and samtools/bcftools#1918

and make repeated bcf_sr_seek()+next_line() calls consistent.

Resolves samtools#1623 and samtools/bcftools#1918
pd3 added a commit to samtools/bcftools that referenced this pull request May 25, 2023
Awaits the merge of samtools/htslib#1624

Resolves #1918
@daviesrob daviesrob self-assigned this May 30, 2023
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test program works better after this change, but still shows a difference between VCF and BCF:

/tmp/test_synced_reader /tmp/test.bcf
seek: 0
      1
seek: 0
      1
seek: 0
      1

/tmp/test_synced_reader /tmp/test.vcf.gz
seek: -1
      1
seek: -1
      1
seek: 0
      1

So when the target chromosome doesn't appear in the input file, bcf_sr_seek() returns -1 for vcf and 0 for bcf. I guess this comes from differences between tbx_name2id() and bcf_hdr_name2id() and also how the underlying indexes work. Would it be possible the make them work in the same way?

Ideally, if you ask for a chromosome that's in the header but not present in the file, I think it should return 0 but put the reader into an EOF-like state so that the caller will get no data for it and move quickly on to the next region. For chromosomes not in the header, arguably it should return -1 instead. But can we do this in VCF, where listing chromosomes in the header is essentially optional...?

* first call to bcf_sr_add_reader().
* API notes:
* - bcf_sr_set_targets MUST be called before the first call to bcf_sr_add_reader()
* - bcf_sr_set_regions AFTER readers where initialized will reposition the readers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - bcf_sr_set_regions AFTER readers where initialized will reposition the readers
* - calling bcf_sr_set_regions AFTER readers have been initialized will reposition the readers

@pd3
Copy link
Member Author

pd3 commented Jun 6, 2023

In past we tried to use the indexes to print some stats and realized they don't carry the same information. I don't remember the details, but it had something to do with contig names not being part of the tbi index, I think.

Therefore for this issue I concluded in the end that the return status cannot be made fully equivalent for both and looked for a different solution.

@jkbonfield
Copy link
Contributor

If you are proposing that it is correct that BCF and VCF have different return values for certain situations, then they should be documented. I can see no information on the expected return values in this PR.

Personally though, I think they should be massaged to be equal as far as is possible, with the potential (documented) caveat for VCF files that have no header. They aren't the norm from what I can see.

@daviesrob
Copy link
Member

I thought that might be the case. Although it may be possible to get more consistency by adding this change:

diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index 702f260..fc48c03 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -521,7 +521,8 @@ static int _reader_seek(bcf_sr_t *reader, const char *seq, hts_pos_t start, hts_
     if ( reader->tbx_idx )
     {
         int tid = tbx_name2id(reader->tbx_idx, seq);
-        if ( tid==-1 ) return -1;    // the sequence not present in this file
+        if ( tid==-1 )               // the sequence not present in this index
+            return reader->header && bcf_hdr_name2id(reader->header, seq) >= 0 ? 0 : -1;
         reader->itr = tbx_itr_queryi(reader->tbx_idx,tid,start,end+1);
     }
     else

So as long as the header has ##contig entries it will behave the same as with BCF, and if it doesn't have any then it'll be no worse off than before. If it returns at this point, reader->itr will be NULL, but that should be OK.

Here's the test program output with that change:

/tmp/test_synced_reader /tmp/test.vcf.gz
seek: 0
      1
seek: 0
      1
seek: 0
      1

@daviesrob
Copy link
Member

I expanded the original test a bit, which showed up another issue. The updated VCF file is:

##fileformat=VCFv4.2
##reference=file://some/path/human_g1k_v37.fasta
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	a
17	1	.	T	G	.	.	.	GT	0/0
19	2	.	C	A	.	.	.	GT	0/0
19	3	.	G	C	.	.	.	GT	0/0
19	4	.	T	C	.	.	.	GT	0/1
19	5	.	A	C	.	.	.	GT	1/1

and the test program now looks like this:

#include <stdio.h> 
#include <stdarg.h> 
#include "htslib/synced_bcf_reader.h"

void error(const char *format, ...) 
{
    va_list ap;
    va_start(ap, format);
    vfprintf(stderr, format, ap);
    va_end(ap);
    exit(-1);
}

void print_next_line(bcf_srs_t *sr)
{
    int count = bcf_sr_next_line(sr);
    bcf1_t *line = count > 0 ? bcf_sr_get_line(sr, 0) : NULL;
    fprintf(stderr, "count %d\n", count);
    if (line) {
      fprintf(stderr, "  tid %d pos %"PRIhts_pos"\n", line->rid, line->pos);
    } else {
      fprintf(stderr, "  NULL\n");
    }
}

int main(int argc, char *argv[])
{   
    bcf_srs_t *sr = bcf_sr_init();
    sr->require_index = 1;
    if ( bcf_sr_add_reader(sr,argv[1])!=1 ) error("Unable to read %s\n",argv[1]);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"17",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"18",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"19",0));
    print_next_line(sr);
    fprintf(stderr,"seek: %d\n",bcf_sr_seek(sr,"20",0));
    print_next_line(sr);
    bcf_sr_destroy(sr);
    return 0;
}

So it tries to access a missing chromosome 18, which is no longer at the start of the file, and it also simulates fetching the next line and printing out the tid and position for the record returned. Running it on the original (test.vcf.gz, test.bcf) and new (test2.vcf.gz, test2.bcf) files gives the following:

./test_synced_reader test.vcf.gz
seek: -1
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1

./test_synced_reader test.bcf
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 2 pos 1

./test_synced_reader test2.vcf.gz
seek: 0
count 1
 tid 0 pos 0
seek: -1
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 1
 tid 0 pos 0

./test_synced_reader test2.bcf
seek: 0
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 0
  NULL

So when bcf_sr_seek() can't find any data for the requested seq, on bcf it starts reading from the next reference, but on vcf.gz it goes back to the start of the file, which probably isn't ideal. I don't think it's actually easily possible to make them behave in exactly the same way due to differences in the indexes. It is possible to make the vcf.gz one return no data for absent sequences though with the following change:

diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index 702f260e..a6f901a8 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -872,7 +872,12 @@ int bcf_sr_seek(bcf_srs_t *readers, const char *seq, hts_pos_t pos)
     //  - find the requested iseq (stored in the seq_hash)
     //  - position regions to the requested position (bcf_sr_regions_overlap)
     bcf_sr_seek_start(readers);
-    if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) readers->regions->iseq = i;
+    if ( khash_str2int_get(readers->regions->seq_hash, seq, &i)>=0 ) {
+        readers->regions->iseq = i;
+    } else {
+        // Ensure we don't try to continue on from a non-existent reference seq
+        readers->regions->iseq = -1;
+    }
     _bcf_sr_regions_overlap(readers->regions, seq, pos, pos, 0);
 
     for (i=0; i<readers->nreaders; i++)

The result is then:

./test_synced_reader test.vcf.gz
seek: -1
count 0
  NULL
seek: -1
count 0
  NULL
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test.bcf
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test2.vcf.gz
seek: 0
count 1
 tid 0 pos 0
seek: -1
count 0
  NULL
seek: 0
count 1
 tid 2 pos 1
seek: -1
count 0
  NULL

./test_synced_reader test2.bcf
seek: 0
count 1
 tid 0 pos 0
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 1
 tid 2 pos 1
seek: 0
count 0
  NULL

So when bcf_sr_seek() fails, bcf_sr_next_line() more reliably returns no data. I decided that trying to get exactly the same results from BCF and VCF isn't going to be easy, so I haven't included the other patch above. Both HTSlib and BCFtools are happy when running make check with this additional change. @pd3 Do you think this looks reasonable?

@pd3
Copy link
Member Author

pd3 commented Jun 15, 2023

Following up on our discussion offline, the bcf_sr_seek() call is currently used only in consensus.c, its use now deprecated by samtools/bcftools#1937. And in vcfconcat.c https://github.com/samtools/bcftools/blob/4d0d262b020386b940ec7f60f636097fa1d4cc3a/vcfconcat.c#L501.

There I don't immediately see if it can be replaced with the updated bcf_sr_set_regions() function or not. It is used only in the phased concatenation mode, see the tests with the -l, --ligate option https://github.com/samtools/bcftools/blob/4d0d262b020386b940ec7f60f636097fa1d4cc3a/test/test.pl#L712-L719

@daviesrob
Copy link
Member

Thanks. I'll see if I can come up with something that exercises the missing chromosome case a bit better. Given the prefect overlap requirement for the --ligate option, would it be correct that you'd drop parts where one of the files didn't have data for one of the chromosomes, unless using --ligate-force? Also, should you be able to swap the two input files, and get the same result?

@pd3
Copy link
Member Author

pd3 commented Jun 15, 2023

When two chunks overlap, i.e. share a genomic region, within that region all sites found in one file must be present also in the other or else they are dropped. I believe the program makes the assumption that the VCFs come in the correct order. However, it shouldn't be difficult to reorder to make it robust against swapping of two input files.

@daviesrob
Copy link
Member

The file ordering certainly makes a difference for vcf. Here's a couple of test inputs, derived from concat.4.a.vcf and concat.4.b.vcf](https://github.com/samtools/bcftools/blob/develop/test/concat.4.b.vcf):

concat.6.a.vcf:

##fileformat=VCFv4.2
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	0/1
chr2	1	.	A	C	.	.	.	GT	0|1
chr2	2	.	C	G	.	.	.	GT	0/1
chr3	1	.	A	C	.	.	.	GT	0|1
chr3	2	.	C	G	.	.	.	GT	0/1

concat6.b.vcf:

##fileformat=VCFv4.2
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT	1|0
chr1	2	.	C	G	.	.	.	GT	0/1
chr1	3	.	G	T	.	.	.	GT	0|1
chr3	1	.	A	C	.	.	.	GT	1|0
chr3	2	.	C	G	.	.	.	GT	0/1
chr3	3	.	G	T	.	.	.	GT	0|1

Trying bcftools concat -l both ways round, I get:

./bcftools concat -l /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.17-53-g4d0d262b+htslib-1.17-39-g7f69840c
##bcftools_viewCommand=view -Oz -o /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.a.vcf; Date=Thu Jun 15 14:52:15 2023
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing Quality (bigger is better)">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##bcftools_concatVersion=1.17-53-g4d0d262b+htslib-1.17-30-g3e2190c9-dirty
##bcftools_concatCommand=concat -l /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz; Date=Fri Jun 16 11:59:00 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT:PS	0|1:1
chr1	2	.	C	G	.	.	.	GT:PQ:PS	0/1:99:1
chr1	3	.	G	T	.	.	.	GT:PS	1|0:1
chr3	1	.	A	C	.	.	.	GT:PS	0|1:1
chr3	2	.	C	G	.	.	.	GT:PS	0/1:1
chr3	3	.	G	T	.	.	.	GT:PS	1|0:1

./bcftools concat -l /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.17-53-g4d0d262b+htslib-1.17-39-g7f69840c
##bcftools_viewCommand=view -Oz -o /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.b.vcf; Date=Thu Jun 15 14:52:21 2023
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing Quality (bigger is better)">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##bcftools_concatVersion=1.17-53-g4d0d262b+htslib-1.17-30-g3e2190c9-dirty
##bcftools_concatCommand=concat -l /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz; Date=Fri Jun 16 11:59:28 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT:PS	1|0:1
chr1	2	.	C	G	.	.	.	GT:PS	0/1:1
chr1	3	.	G	T	.	.	.	GT:PS	1|0:1
chr3	1	.	A	C	.	.	.	GT:PS	0|1:1
chr3	2	.	C	G	.	.	.	GT:PS	0/1:1
chr3	3	.	G	T	.	.	.	GT:PS	0|1:1
chr2	1	.	A	C	.	.	.	GT:PS	0|1:1
chr2	2	.	C	G	.	.	.	GT:PS	0/1:1

So the second is incorrectly ordered compared to the header, and also includes chr2 which the first left out.

I think this comes from how the references are added to the synced reader regions list. Switching to bcftools concat -a, which basically just pulls reads from the synced reader, gives (with a few headers removed to make it easier to read):

./bcftools concat -a /tmp/concat/concat.6.a.vcf.gz /tmp/concat/concat.6.b.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	1	.	A	C	.	.	.	GT	1|0
chr1	2	.	C	G	.	.	.	GT	0/1
chr1	2	.	C	G	.	.	.	GT	0/1
chr1	3	.	G	T	.	.	.	GT	0|1
chr2	1	.	A	C	.	.	.	GT	0|1
chr2	2	.	C	G	.	.	.	GT	0/1
chr3	1	.	A	C	.	.	.	GT	0|1
chr3	1	.	A	C	.	.	.	GT	1|0
chr3	2	.	C	G	.	.	.	GT	0/1
chr3	2	.	C	G	.	.	.	GT	0/1
chr3	3	.	G	T	.	.	.	GT	0|1

./bcftools concat -a /tmp/concat/concat.6.b.vcf.gz /tmp/concat/concat.6.a.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	1	.	A	C	.	.	.	GT	1|0
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	0/1
chr1	2	.	C	G	.	.	.	GT	0/1
chr1	3	.	G	T	.	.	.	GT	0|1
chr3	1	.	A	C	.	.	.	GT	1|0
chr3	1	.	A	C	.	.	.	GT	0|1
chr3	2	.	C	G	.	.	.	GT	0/1
chr3	2	.	C	G	.	.	.	GT	0/1
chr3	3	.	G	T	.	.	.	GT	0|1
chr2	1	.	A	C	.	.	.	GT	0|1
chr2	2	.	C	G	.	.	.	GT	0/1

While if you try with bcf, you'll find that the ordering is the same as the first case for both ways round. The code that adds new entries to the regions list is in bcf_sr_add_reader(). For the vcf files, it's using tbx_seqnames() to get a list of names out of the index and merging them in with any it already has. The tabix style index only lists sequences mentioned in the variant records, which explains the difference. For the a,b ordering the first file lists all of the names and it works as you'd expect. For the b,a ordering, concat.6.b.vcf.gz only adds chr1 and chr3; then when concat.6.a.vcf.gz is added to the reader it finds then chr1 and chr3 are already present, but chr2 is new so it gets added at the end.

I think it would help to get a more consistent output if the reader always used bcf_hdr_seqnames() to add names from the header, and then supplemented them with ones found with tbx_seqnames() that might have sneaked through without a ##contig line. It should also add any new ones it finds in the index to the file header that's going to be output in the same way as the VCF reader does it in add_missing_contig_hrec(). Hopefully the result would then both preserve the global ordering as much as possible, and also output a header with the ##contig lines in an order that matches the data lines that come later.

(Unfortunately even with this, bcftools concat -l still gives a different answer for the two orderings. But I'm not sure exactly what it should do, or if these are even strictly valid inputs.)

In other news, I've confirmed that this fix, with or without my tweak, makes the example in samtools/bcftools#1918 work with bcftools consensus even without having merged samtools/bcftools#1937.

@pd3
Copy link
Member Author

pd3 commented Jun 16, 2023

So the second is incorrectly ordered compared to the header, and also includes chr2 which the first left out.

Just a quick note before analyzing the entire comment: in VCF the order of contigs in the header does not have to match the order of contigs in the body, so this bit is perfectly fine.

@daviesrob
Copy link
Member

I did wonder if that might be the case. However, I think there's something to be said for using the ordering in the header if available as it makes vcf and bcf inputs behave in a more similar way. And it may also help the aim of getting bcf_sr_seek() to work better when some chromosomes are absent from some of the input files.

@daviesrob
Copy link
Member

I've looked more at bcftools concat -l. I think this check to see if it should add a new reader may be a bit broken because it doesn't check that the chromosome for line matches the one it found when setting args->start_pos[args->ifname]. However if it did come across such an input then it would have already missed all the overlaps on the first chromosome in the new file. I also think this can only happen if you've got at least three input files, so it might be best to give up if it happens as you're no longer properly synchronised.

Other than that, I don't think it's possible for bcftools concat -l to ask bcf_sr_seek() to go anywhere other than deliberately to the start of the file, or to the first chromosome (which it already knows exists). So inconsistencies in its handling of chromosomes that don't appear in any variant records probably aren't going to cause too much trouble, at least in bcftools.

As the bcf_sr_set_regions() change does work, I'll merge this with my comment fix, and sort out other synced reader issues elsewhere.

@daviesrob daviesrob merged commit c11aebe into samtools:develop Jun 21, 2023
pd3 added a commit to samtools/bcftools that referenced this pull request Jun 22, 2023
Awaits the merge of samtools/htslib#1624

Resolves #1918
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this pull request Aug 17, 2023
and make repeated bcf_sr_seek()+next_line() calls consistent.

Resolves samtools#1623 and samtools/bcftools#1918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcf_sr_seek behaves differently for VCF.gz and BCF
3 participants