AdaptiveGenotyper 5.2.10 Design Document

Overview

The Adaptive Genotyper is a command line tool used for adaptive germline calling of genotypes on VCFs. It takes in a somatic, uncrushed VCF, using either the option, -MinVQ 0, or -gvcf true, and reads in all the allele depths and total depths, and fits a three-component binomial mixture model to the population of allele frequencies. Each of the three components corresponds to the homozygous reference, heterozygous and homozygous alternate calls. After fitting the model, each locus is categorized into one of the three categories based on allele support and total depth, and genotype posterior probabilities and Q scores are calculated.

Phred-scaled genotype posterior probabilities are reported on the VCF as GP in the genotype field per VCF 4.1 specifications. Q scores are calculated as the Phred-scaled probability that the genotype is incorrect (i.e. 1 - posterior of that genotype). Note that this is a different definition from the usual likelihood based methods, but still fits within the VCF specifications. AdaptiveGT generates models files that can also be used across different data sets. If a model file is provided, no fitting will be performed–instead categorization into the existing model and calculation of GP and Q scores will be done and a new output VCF will be written.

Glossary

Pisces Glossary

Configuration

AdaptiveGT supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet AdaptiveGenotyper.dll [-options]

Example: dotnet AdaptiveGenotyper.dll -vcf input.genome.gvcf -out /my/outfolder

SDS ID	Specification
SDS-1	AdaptiveGT shall accept command line arguments as a whitespace-separated list of name and value pairs. For example:
SDS-2	If an invalid command is given, AdaptiveGT shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3	AdaptiveGT command line shall be capitalization invariant.

SDS ID	Specification
SDS-4	AdaptiveGT shall require the command line arguments listed below:

Argument Name	Type	Default value	Description
vcf	string	none	File path for input vcf

SDS ID	Specification
SDS-5	AdaptiveGT shall optionally support the command line arguments listed below.

Argument Name	Type	Default value	Description
-log	string	AdaptiveGTLog.txt	Log file name
-o / -out / -outfolder	string	VCF directory	File path for output VCF
-model / -models	string	none	Models file path. Specifies the file that models are read from. Each model should have 4 lines: Line 1: 3+ doubles that specifies the means of each binomial population for SNVs Line 2: Prior distribution that corresponds to each population. Must have same number of elements as Line 1 Line 3: Means for INDELs, specified in the same format as Line 1 Line 4: Prior distribution of indels, specified in the same format as Line 2

SDS ID	Specification
SDS-6	If an unknown command line argument is encountered, AdaptiveGT shall exit with an error message describing the unknown argument.

Input

AdaptiveGT requires as input one somatic, uncrushed VCF file. The VCF file should be formatted such that each variant allele has its own line in the VCF file. Pisces output has this format by default. In addition, the input needs to either be a GVCF, with all loci reported, or the option -MinVQ 0 needs to be used such that most variants down to 1% are reported.

SDS ID	Specification
SDS-7	AdaptiveGT shall require one VCF file as input.

Output

AdaptiveGT outputs one VCF file, with the same convention and structure as the input file except with crushed loci (each locus will be one line).

SDS ID	Specification
SDS-8	AdaptiveGT shall produce output files in the same directory as input VCF file unless otherwise specified with the optional argument.
SDS-9	AdaptiveGT shall output a gVCF as described in the VCF 4.1 specifications document.
SDS-10	AdaptiveGT the output file name shall be the input file name with ".recal" appended to the file name.

Design

AdaptiveGT reads in the allele depths and total depths at each locus, and fits a binomial mixture model to the population. A histogram of the population of allele frequencies looks something like this (n = 270k):

As can be seen from the distribution, there are roughly three binomial populations with p = 0.01, 0.5, and 1. The exact means of each population of found via the expectation-maximization (EM) algorithm, optimizing for the binomial mixture model. Then, using each mean and prior distribution (which is the number of loci belonging to each component divided by the total number of loci), we can calculate a posterior probability for each locus belonging to each distribution. The posterior probability is calculated from the following equation:

P(x|theta) is the likelihood of observing that locus and P(theta) is the prior probability (estimated in the way specified above). P(x), or the normalizing factor, is calculated by summing the numerator of all possible categories (3, or 4, as explained below).

The highest posterior probability is taken as that locus's true category, and the posteriors are reported as GP in the resulting VCF. The Q score is calculated by taking 1 - posterior of its actual category (the highest posterior). Both numbers are Phred scaled and the Q score is rounded to the nearest integer per VCF 4.1 specifications.

Separate mixture models are used for indels and SNVs because aligners tend to produce different biases for each. Empirically, with a large enough data set such as Halo, they converge to different means, indicating that they are indeed different distributions.

Ad hoc parameters, or why the theory doesn't always work out

There are a few ad hoc parameters that users should be aware of. The reason that these extra parameters are necessary is because there is over-dispersion of the binomial models. Since the binomial model only accounts for variation due to sampling, it doesn't account for all sources of variation. Variation can be due to technical errors and ploidy variation. Thus, a few parameters were designed to deal with the over-dispersion to get the binomial models.

The first parameter used is a variant frequency filter. Much of the over-dispersion arises in the homozygous reference group, probably due to non-constant error rates, so in order to deal with that, alleles with a variant frequency of under 0.02 are not considered for the fitting of the model. This does have an impact on the categorization of loci, and has been tested on data to ensure that the cluster means are a good representation of truth sets.

The second heuristic that we use is the down-sampling of alleles with the total depth > 1000. This is further explained in the limitations section.

In order to try to account for the over-dispersion in the data, there is an adjustment when making the Q-score estimation. AdaptiveGT estimates the variance in the sample population and down-samples total depth at that locus to be roughly equal in variance with the observed variance. Typical depth values that AdaptiveGT downsamples to are around 20. Note that this downsampling step is only used for Q-score estimationa and not genotype determination.

Results

Categorization of genotypes on the Halo dataset seems achieve similar accuracy as before. Here is a comparison of the Happy results from the original and the calls made using AdaptiveGT.

Original

Type	Filter	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score
INDEL	ALL	1875	1568	307	2012	202	237	65	46	0.836267	0.886197	0.117793	0.860508
INDEL	PASS	1875	1498	377	1788	97	190	37	17	0.798933	0.939299	0.106264	0.863449
SNP	ALL	36659	35529	1130	37531	625	1376	55	88	0.969175	0.982713	0.036663	0.975897
SNP	PASS	36659	35485	1174	37138	376	1277	55	39	0.967975	0.989515	0.034385	0.978627

AdaptiveGT

Type	Filter	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score
INDEL	ALL	1875	1579	296	2048	210	250	71	56	0.842133	0.883204	0.12207	0.86218
INDEL	PASS	1875	1498	377	1818	109	204	44	26	0.798933	0.932466	0.112211	0.86055
SNP	ALL	36659	35524	1135	37471	568	1378	55	94	0.969039	0.984263	0.036775	0.976592
SNP	PASS	36659	35481	1178	37105	346	1277	55	41	0.967866	0.990343	0.034416	0.978975

Limitations

The current model only accounts for variation from sampling. One of the reasons for that is that other sources of variation may differ between samples. For example, variation may come from library preparation method, sequencing platform, different tissues, etc. Currently, for samples with total read depth > 1000, it is down-sampled to 1000 so that we can have finite probabilities (within machine precision) using the binomial or multinomial distribution. This brings the variance expected to be more in line with the actual variance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaptiveGenotyper 5.2.10 Design Document

Overview

Glossary

Configuration

Input

Output

Design

Ad hoc parameters, or why the theory doesn't always work out

Results

Limitations

General

5.3.0

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally