-
Notifications
You must be signed in to change notification settings - Fork 38
All About Genomic Ranges
GLnexus represents genomic ranges using the range
structure in types.h. This is essentially a triplet of integers (rid,beg,end)
, where:
-
rid
is the ID of the reference contig (a zero-based index into the contigs listed in the VCF header) -
beg
is the beginning position of the range -
end
is the end position.
The positions are zero-based and the interval is half-open, that is, beg
is inclusive but end
is exclusive. Here's an illustrative figure (which uses lo
and hi
instead of beg
and end
):
The size of a range is end-beg
with no adding or subtracting 1. Note also that the ranges [x,y)
and [y,z)
abut but do not overlap.
This convention has been thoroughly litigated throughout the history of bioinformatics; here's some background reading for those especially interested: 1 2 3 4 5 6 7.
In spite of that, bioinformatics file formats and user interfaces frequently use one-based positions, because these are supposedly more "biologist-friendly." The text [g]VCF format in particular uses one-based, closed intervals, where both the beginning and end positions are inclusive. Let us consider this excerpt of a gVCF file:
21 10009462 . T <NON_REF> . . END=10009463 GT:DP:GQ:MIN_DP:PL 0/0:14:36:13:0,36,540
21 10009464 . TA T,<NON_REF> 0.03 . BaseQRankSum=-2.041;ClippingRankSum=-0.322;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=31.36;MQ0=0;MQRankSum=-0.752;ReadPosRankSum=0.322 GT:AD:DP:GQ:PL:SB 0/1:10,2,0:12:16:16,0,240,46,246,292:1,9,1,1
21 10009466 . A <NON_REF> . . END=10009466 GT:DP:GQ:MIN_DP:PL 0/0:13:0:13:0,0,342
Here we have three abutting ranges with sizes 2, 2, 1. In detail:
The first range begins at and includes one-based position 10009462 (the second column), and ends at and includes one-based position 10009463 (the END
INFO field). This is GLnexus::range(20,10009461,10009463)
if we assume chromosome 21 has rid=20
. But why is the reference allele (fourth column) only one base and not two? This is the gVCF convention for records giving reference coverage with no alternate alleles. Only the first reference base is used, to avoid the need to write long stretches of exactly the reference genome into the file.
In the second record, the absence of the END field means we should consider the size of the range to be the length of the reference allele (a so-called "precise variant"). So this is GLnexus::range(20,10009463,10009465)
.
The third is another gVCF reference record which is only one base long, GLnexus::range(20,10009465,10009466)
.
Aside: the VCF 4.2 specification implies the inclusivity of the [g]VCF END field only indirectly, with the following statement: "For precise variants, END is POS + length of REF allele - 1". Hopefully this will be more explicit in the upcoming 4.3 spec.
Now unlike the text VCF format, the binary BCF equivalent does use zero-based positions! If you use htslib functions to read a VCF record into a bcf1_t
, you'll find the beginning position integer bcf_record->pos
to be one less than the position written in the text VCF file. Conversely, if you write a bcf1_t
out to text VCF, the text position will be one greater than the in-memory integer.
However, htslib does not manipulate the END field if it's present. The integer you'll find in the bcf1_t
, bcf_get_info(bcf_header, bcf_record, "END")->v1.i
, is equal to the END position written in the VCF. This actually turns out to be "convenient" in an embarrassing way, because we can pretend the one-based, inclusive text position is the zero-based, exclusive position we want.
As a result, to construct a GLnexus::range
from a bcf1_t
, we'd copy bcf_record->rid
, set beg
to bcf_record->pos
, and:
- if the END INFO field is present, set
end
to its contents. - otherwise set
end
tobeg+strlen(bcf_record->d.allele[0])
Fortunately htslib implements similar logic already, in bcf_record->rlen
.