Skip to content

All About Genomic Ranges

Mike Lin edited this page Sep 1, 2015 · 14 revisions

GLnexus represents genomic ranges using the range structure in types.h. This is essentially a triplet of integers (rid,beg,end), where:

  • rid is the ID of the reference contig (a zero-based index into the contigs listed in the VCF header)
  • beg is the beginning position of the range
  • end is the end position.

The positions are zero-based and the interval is half-open, that is, beg is inclusive but end is exclusive. Here's an illustrative figure (which uses lo and hi instead of beg and end):

The size of a range is end-beg with no adding or subtracting 1. Note also that the ranges [x,y) and [y,z) abut but do not overlap.

This convention has been thoroughly litigated throughout the history of bioinformatics; here's some background reading for those especially interested: 1 2 3 4 5 6 7.

[g]VCF

In spite of that, bioinformatics file formats and user interfaces frequently use one-based positions, because these are supposedly more "biologist-friendly." The text [g]VCF format in particular uses one-based, closed intervals, where both the beginning and end positions are inclusive. Let us consider this excerpt of a gVCF file:

21	10009462	.	T	<NON_REF>	.	.	END=10009463	GT:DP:GQ:MIN_DP:PL	0/0:14:36:13:0,36,540
21	10009464	.	TA	T,<NON_REF>	0.03	.	BaseQRankSum=-2.041;ClippingRankSum=-0.322;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=31.36;MQ0=0;MQRankSum=-0.752;ReadPosRankSum=0.322	GT:AD:DP:GQ:PL:SB	0/1:10,2,0:12:16:16,0,240,46,246,292:1,9,1,1
21	10009466	.	A	<NON_REF>	.	.	END=10009466	GT:DP:GQ:MIN_DP:PL	0/0:13:0:13:0,0,342

Here we have three abutting ranges with sizes 2, 2, 1. In detail:

The first range begins at and includes one-based position 10009462 (the second column), and ends at and includes one-based position 10009463 (the END INFO field). This is GLnexus::range(20,10009461,10009463) if we assume chromosome 21 has rid=20. But why is the reference allele (fourth column) only one base and not two? This is the gVCF convention for records giving reference coverage with no alternate alleles. Only the first reference base is used, to avoid the need to write long stretches of exactly the reference genome into the file.

In the second record, the absence of the END field means we should consider the size of the range to be the length of the reference allele (a so-called "precise variant"). So this is GLnexus::range(20,10009463,10009465).

The third is another gVCF reference record which is only one base long, GLnexus::range(20,10009465,10009466).

Aside: the VCF 4.2 specification implies the inclusivity of the [g]VCF END field only indirectly, with the following statement: "For precise variants, END is POS + length of REF allele - 1". Hopefully this will be more explicit in the upcoming 4.3 spec.

BCF and bcf1_t

Now unlike the text VCF format, the binary BCF equivalent does use zero-based positions! If you use htslib functions to read a VCF record into a bcf1_t, you'll find the beginning position integer bcf_record->pos to be one less than the position written in the text VCF file. Conversely, if you write a bcf1_t out to text VCF, the text position will be one greater than the in-memory integer.

However, htslib does not manipulate the END field if it's present. The integer you'll find in the bcf1_t, bcf_get_info(bcf_header, bcf_record, "END")->v1.i, is equal to the END position written in the VCF. This actually turns out to be "convenient" in an embarrassing way, because we can pretend the one-based, inclusive text position is the zero-based, exclusive position we want.

As a result, to construct a GLnexus::range from a bcf1_t, we'd copy bcf_record->rid, set beg to bcf_record->pos, and:

  • if the END INFO field is present, set end to its contents.
  • otherwise set end to beg+strlen(bcf_record->d.allele[0])

Fortunately htslib implements similar logic already, in bcf_record->rlen.