Skip to content

Commit

Permalink
Give control over creation of vectors with mixed known and missing va…
Browse files Browse the repository at this point in the history
…lues

When input files have different alternate alleles, vector fields pertaining to
unobserved alleles are set to missing by default. This creates vectors with
mixed known and unknown values and some programs refuse to work with such.

This commit changes the default behavior of --gvcf merging:

- whenever the uknown allele is present (<*> or <NON_REF>) its values are
  used instead of the missing value
- when it is not present, the default rule `-M PL:max,AD:0` is used

A new `-M, --missing-rules` option is added which allows to override the
default `-M PL:max,AD:0` rule implied by `--gvcf`. Note that the use of
the unknown allele (<*> or <NON_REF>) given explicitly cannot be overriden,
which I believe is the correct behavior.

Resolves #1888.
  • Loading branch information
pd3 committed Mar 22, 2023
1 parent 4fc33e0 commit f2d2fdf
Show file tree
Hide file tree
Showing 9 changed files with 261 additions and 17 deletions.
8 changes: 8 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ Changes affecting specific commands:

- Fix a bug where the `-m` function did not respect the `--min-overlap` option (#1869)

* bcftools merge

- New `-M, --missing-rules` option to control the behavior of merging of vector tags
to prevent mixtures of known and missing values in tags when desired

- Use values pertaining to the unknown allele (<*> or <NON_REF>) when available
to prevent mixtures of known and missing values (#1888)

* bcftools +split-vep

- Prevent a segfault when `-i/-e` use a VEP subfield not included in `-f` or `-c` (#1877)
Expand Down
12 changes: 11 additions & 1 deletion doc/bcftools.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1919,7 +1919,8 @@ For "vertical" merge take a look at *<<concat,bcftools concat>>* or *<<norm,bcft
file 'FILE' is not given and the dash ('-') is given, unknown reference
bases generated at gVCF block splits will be substituted with N's.
The *--gvcf* option uses the following default INFO rules:
*-i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max*.
*-i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max* and the following missing
rules: *-M PL:max,AD:0*.

*-i, --info-rules* '-'|'TAG:METHOD'[,...]::
Rules for merging INFO fields (scalars or vectors) or '-' to disable the
Expand Down Expand Up @@ -1955,6 +1956,15 @@ For "vertical" merge take a look at *<<concat,bcftools concat>>* or *<<norm,bcft
-m id .. merge by ID
----

*-M, --missing-rules* '-'|'TAG:METHOD'[,...]::
Rules for merging vector tags at multiallelic sites. When input files have different alternate
alleles, vector fields pertaining to unobserved alleles are set to missing ('.') by default.
The 'METHOD' is one of '.' (the default, use missing values), 'NUMBER' (use a constant value, e.g. 0),
'max' (the maximum value observed for other alleles in the sample). When *--gvcf* option is set,
the rule *-M PL:max,AD:0* is implied. This can be overriden with providing *-M -* or *-M PL:.,AD:.*.
Note that if the unobserved allele is explicitly present as '<*>' or '<NON_REF>', then its corresponding
value will be used regardless of *-M* settings.

*--no-index*::
the option allows to merge files without indexing them first. In order for this
option to work, the user must ensure that the input files have chromosomes in
Expand Down
10 changes: 5 additions & 5 deletions test/merge.gvcf.2.out
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,16 @@
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AAA BBB CCC
2 21444416 . G <*> . . END=21444427;MinDP=5;QS=1,0 PL:DP 0,15,125:5 .:. .:.
2 21444428 . C <*> . . END=21444429;MinDP=2;QS=2,0 PL:DP 0,15,125:5 0,6,51:2 .:.
2 21444430 . TCAA T,TAA,<*> 0 . MinDP=2;QS=1.60366,0.304878,0.0914634,0 PL:DP:DV 37,0,79,35,73,113,.,.,.,.:5:2 0,.,.,.,.,.,6,.,.,51:2:. .:.:.
2 21444430 . TCAA T,TAA,<*> 0 . MinDP=2;QS=1.60366,0.304878,0.0914634,0 PL:DP:DV 37,0,79,35,73,113,113,113,113,113:5:2 0,51,51,51,51,51,6,51,51,51:2:. .:.:.
2 21444431 . C <*> . . MinDP=4;QS=1,0 PL:DP 0,12,110:4 .:. .:.
2 21444431 . CA C,CAAACAAAAAA 0 . QS=0.75,0.25,1 PL:DP:DV 0,4,10,.,.,.:4:1 28,.,.,3,.,0:1:1 .:.:.
2 21444431 . CA C,CAAACAAAAAA 0 . QS=0.75,0.25,1 PL:DP:DV 0,4,10,10,10,10:4:1 28,28,28,3,28,0:1:1 .:.:.
2 21444433 . C <*> 0 . END=21444444;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
3 1 . C <*> 0 . END=4;MinDP=33;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
3 5 . C <*>,T 0 . MinDP=33;QS=1.5,0.25,0.25 PL:DP:DV 0,4,10,.,.,.:4:1 0,.,.,4,.,10:4:1 .:.:.
3 5 . C <*>,T 0 . MinDP=33;QS=1.5,0.25,0.25 PL:DP:DV 0,4,10,10,10,10:4:1 0,10,10,4,10,10:4:1 .:.:.
3 6 . N <*> 0 . END=10;MinDP=33;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
1 1619670 . C <*> 0 . END=1619782;MinDP=33;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
1 1619783 . C <*> 0 . END=1619787;MinDP=33;QS=0.75,1.25 PL:DP:DV 0,4,10:4:1 28,3,0:1:1 .:.:.
1 1619788 . G <*>,GAAAAAAA 0 . MinDP=33;QS=0.75,0.25,1 PL:DP:DV 0,4,10,.,.,.:4:1 28,.,.,3,.,0:1:1 .:.:.
1 1619788 . G <*>,GAAAAAAA 0 . MinDP=33;QS=0.75,0.25,1 PL:DP:DV 0,4,10,10,10,10:4:1 28,28,28,3,28,0:1:1 .:.:.
1 1619789 . N <*> 0 . END=1619877;MinDP=33;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
4 20000975 . C <*> 0 . END=20001021;MinDP=33;QS=0.75,0.25 PL:DP:DV 0,4,10:4:1 .:.:. .:.:.
4 20001022 . C <*> 0 . END=20001070;MinDP=33;QS=1.5,0.5 PL:DP:DV 0,4,10:4:1 0,4,10:4:1 .:.:.
Expand All @@ -48,5 +48,5 @@
7 703 . T <*> . . . PL 77,1,2 . .
7 704 . N <*> . . END=777 PL 77,1,2 . .
8 1 . T <*> . . END=2 PL 88,1,1 . .
8 3 . T <*>,A . . . PL 88,1,1,.,.,. 88,.,.,2,.,1 88,.,.,3,.,1
8 3 . T <*>,A . . . PL 88,1,1,1,1,1 88,88,88,2,88,1 88,88,88,3,88,1
8 4 . N <*> . . END=10 PL 88,1,1 . .
19 changes: 19 additions & 0 deletions test/merge.mrules.1.1.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##reference=file://hs38DH.fa
##contig=<ID=chr1>
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Minimum per-sample depth in this gVCF block">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SampleA SampleB
chr1 1769963 . A <NON_REF> . . END=1769967 GT:PL 0/0:0,3,45 ./.:.
chr1 1769968 . T <NON_REF> . . . GT:PL 0/0:0,3,45 0/0:0,18,270
chr1 1769969 . CAAAACAAAAACA CAAAACA,<NON_REF>,C . . . GT:AD:PL 1/1:0,9,0,0:405,27,0,405,27,405,405,405,405,405 3/3:0,0,0,4:181,181,181,181,181,181,12,181,12,0
chr1 1769976 . A <NON_REF> . . . GT:PL 0/0:0,0,0 ./.:.
chr1 1769982 . A <NON_REF> . . . GT:PL ./.:. 0/0:0,0,0
chr1 1769983 . C T,A . . . GT:AD:PL 1/1:0,9,0:405,27,0,405,405,405 2/2:0,0,4:181,181,181,12,181,0
chr1 1769990 . CAAAACAAAAACA CAAAACA,<NON_REF>,C . . . GT:AD:PL 1:0,9,0,0:405,27,0,0 3:0,0,0,4:181,0,0,12
chr1 1769991 . C T,A . . . GT:AD:PL 1:0,9,0:405,0,405 2:0,0,4:181,181,0
19 changes: 19 additions & 0 deletions test/merge.mrules.1.2.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##reference=file://hs38DH.fa
##contig=<ID=chr1>
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Minimum per-sample depth in this gVCF block">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SampleA SampleB
chr1 1769963 . A <NON_REF> . . END=1769967 GT:PL 0/0:0,3,45 ./.:.
chr1 1769968 . T <NON_REF> . . . GT:PL 0/0:0,3,45 0/0:0,18,270
chr1 1769969 . CAAAACAAAAACA CAAAACA,<NON_REF>,C . . . GT:AD:PL 1/1:0,9,0,0:405,27,0,405,27,405,405,405,405,405 3/3:0,0,0,4:181,181,181,181,181,181,12,181,12,0
chr1 1769976 . A <NON_REF> . . . GT:PL 0/0:0,0,0 ./.:.
chr1 1769982 . A <NON_REF> . . . GT:PL ./.:. 0/0:0,0,0
chr1 1769983 . C T,A . . . GT:AD:PL 1/1:0,9,.:405,27,0,.,.,. 2/2:0,.,4:181,.,.,12,.,0
chr1 1769990 . CAAAACAAAAACA CAAAACA,<NON_REF>,C . . . GT:AD:PL 1:0,9,0,0:405,27,0,0 3:0,0,0,4:181,0,0,12
chr1 1769991 . C T,A . . . GT:AD:PL 1:0,9,.:405,0,. 2:0,.,4:181,.,0
16 changes: 16 additions & 0 deletions test/merge.mrules.1.a.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
##fileformat=VCFv4.2
##reference=file://hs38DH.fa
##contig=<ID=chr1>
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Minimum per-sample depth in this gVCF block">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SampleA
chr1 1769963 . A <NON_REF> . . END=1769968 GT:PL 0/0:0,3,45
chr1 1769969 . CAAAACA C,<NON_REF> . . . GT:AD:PL 1/1:0,9,0:405,27,0,405,27,405
chr1 1769976 . A <NON_REF> . . END=1769976 GT:PL 0/0:0,0,0
chr1 1769983 . C T . . . GT:AD:PL 1/1:0,9:405,27,0
chr1 1769990 . CAAAACA C,<NON_REF> . . . GT:AD:PL 1:0,9,0:405,27,0
chr1 1769991 . C T . . . GT:AD:PL 1:0,9:405,0
16 changes: 16 additions & 0 deletions test/merge.mrules.1.b.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
##fileformat=VCFv4.2
##reference=file://hs38DH.fa
##contig=<ID=chr1>
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Minimum per-sample depth in this gVCF block">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SampleB
chr1 1769968 . T <NON_REF> . . END=1769968 GT:PL 0/0:0,18,270
chr1 1769969 . CAAAACAAAAACA C,<NON_REF> . . . GT:AD:PL 1/1:0,4,0:181,12,0,181,12,181
chr1 1769982 . A <NON_REF> . . END=1769982 GT:PL 0/0:0,0,0
chr1 1769983 . C A . . . GT:AD:PL 1/1:0,4:181,12,0
chr1 1769990 . CAAAACAAAAACA C,<NON_REF> . . . GT:AD:PL 1:0,4,0:181,12,0
chr1 1769991 . C A . . . GT:AD:PL 1:0,4:181,0
2 changes: 2 additions & 0 deletions test/test.pl
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@
run_test(\&test_vcf_merge,$opts,in=>['merge.10.a','merge.10.b'],out=>'merge.10.1.out',args=>'-m none');
run_test(\&test_vcf_merge,$opts,in=>['merge.10.a','merge.10.b'],out=>'merge.10.2.out',args=>'-m both');
run_test(\&test_vcf_merge,$opts,in=>['merge.10.a','merge.10.b'],out=>'merge.10.3.out',args=>'-m snp-ins-del');
run_test(\&test_vcf_merge,$opts,in=>['merge.mrules.1.a','merge.mrules.1.b'],out=>'merge.mrules.1.1.out',args=>'--gvcf -');
run_test(\&test_vcf_merge,$opts,in=>['merge.mrules.1.a','merge.mrules.1.b'],out=>'merge.mrules.1.2.out',args=>'--gvcf - -M AD:.,PL:.');
# run_test(\&test_vcf_merge_big,$opts,in=>'merge_big.1',out=>'merge_big.1.1',nsmpl=>79000,nfiles=>79,nalts=>486,args=>''); # commented out for speed
run_test(\&test_vcf_query,$opts,in=>'query.string',out=>'query.string.1.out',args=>q[-f '%CHROM\\t%POS\\t%CLNREVSTAT\\n' -i'CLNREVSTAT="criteria_provided,_conflicting_interpretations"']);
run_test(\&test_vcf_query,$opts,in=>'query.string',out=>'query.string.1.out',args=>q[-f '%CHROM\\t%POS\\t%CLNREVSTAT\\n' -i'CLNREVSTAT="criteria_provided" || CLNREVSTAT="_conflicting_interpretations"']);
Expand Down
Loading

0 comments on commit f2d2fdf

Please sign in to comment.