Skip to content

AdaptiveGenotyper 5.2.10 Design Document

tamsen edited this page Oct 27, 2020 · 1 revision

Overview

The Adaptive Genotyper is a command line tool used for adaptive germline calling of genotypes on VCFs. It takes in a somatic, uncrushed VCF, using either the option, -MinVQ 0, or -gvcf true, and reads in all the allele depths and total depths, and fits a three-component binomial mixture model to the population of allele frequencies. Each of the three components corresponds to the homozygous reference, heterozygous and homozygous alternate calls. After fitting the model, each locus is categorized into one of the three categories based on allele support and total depth, and genotype posterior probabilities and Q scores are calculated.

Phred-scaled genotype posterior probabilities are reported on the VCF as GP in the genotype field per VCF 4.1 specifications. Q scores are calculated as the Phred-scaled probability that the genotype is incorrect (i.e. 1 - posterior of that genotype). Note that this is a different definition from the usual likelihood based methods, but still fits within the VCF specifications. AdaptiveGT generates models files that can also be used across different data sets. If a model file is provided, no fitting will be performed–instead categorization into the existing model and calculation of GP and Q scores will be done and a new output VCF will be written.

Glossary

Pisces Glossary

Configuration

AdaptiveGT supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet AdaptiveGenotyper.dll [-options]

Example: dotnet AdaptiveGenotyper.dll -vcf input.genome.gvcf -out /my/outfolder

SDS ID Specification
SDS-1 AdaptiveGT shall accept command line arguments as a whitespace-separated list of name and value pairs. For example:
SDS-2 If an invalid command is given, AdaptiveGT shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3 AdaptiveGT command line shall be capitalization invariant.
SDS ID Specification
SDS-4 AdaptiveGT shall require the command line arguments listed below:
Argument Name Type Default value Description
vcf string none File path for input vcf
SDS ID Specification
SDS-5 AdaptiveGT shall optionally support the command line arguments listed below.
Argument Name Type Default value Description
-log string AdaptiveGTLog.txt Log file name
-o / -out / -outfolder string VCF directory File path for output VCF
-model / -models string none Models file path. Specifies the file that models are read from. Each model should have 4 lines:
Line 1: 3+ doubles that specifies the means of each binomial population for SNVs
Line 2: Prior distribution that corresponds to each population. Must have same number of elements as Line 1
Line 3: Means for INDELs, specified in the same format as Line 1
Line 4: Prior distribution of indels, specified in the same format as Line 2
SDS ID Specification
SDS-6 If an unknown command line argument is encountered, AdaptiveGT shall exit with an error message describing the unknown argument.

Input

AdaptiveGT requires as input one somatic, uncrushed VCF file. The VCF file should be formatted such that each variant allele has its own line in the VCF file. Pisces output has this format by default. In addition, the input needs to either be a GVCF, with all loci reported, or the option -MinVQ 0 needs to be used such that most variants down to 1% are reported.

SDS ID Specification
SDS-7 AdaptiveGT shall require one VCF file as input.

Output

AdaptiveGT outputs one VCF file, with the same convention and structure as the input file except with crushed loci (each locus will be one line).

SDS ID Specification
SDS-8 AdaptiveGT shall produce output files in the same directory as input VCF file unless otherwise specified with the optional argument.
SDS-9 AdaptiveGT shall output a gVCF as described in the VCF 4.1 specifications document.
SDS-10 AdaptiveGT the output file name shall be the input file name with ".recal" appended to the file name.

Design

AdaptiveGT reads in the allele depths and total depths at each locus, and fits a binomial mixture model to the population. A histogram of the population of allele frequencies looks something like this (n = 270k):

As can be seen from the distribution, there are roughly three binomial populations with p = 0.01, 0.5, and 1. The exact means of each population of found via the expectation-maximization (EM) algorithm, optimizing for the binomial mixture model. Then, using each mean and prior distribution (which is the number of loci belonging to each component divided by the total number of loci), we can calculate a posterior probability for each locus belonging to each distribution. The posterior probability is calculated from the following equation:

P(x|theta) is the likelihood of observing that locus and P(theta) is the prior probability (estimated in the way specified above). P(x), or the normalizing factor, is calculated by summing the numerator of all possible categories (3, or 4, as explained below).

The highest posterior probability is taken as that locus's true category, and the posteriors are reported as GP in the resulting VCF. The Q score is calculated by taking 1 - posterior of its actual category (the highest posterior). Both numbers are Phred scaled and the Q score is rounded to the nearest integer per VCF 4.1 specifications.

Separate mixture models are used for indels and SNVs because aligners tend to produce different biases for each. Empirically, with a large enough data set such as Halo, they converge to different means, indicating that they are indeed different distributions.

Ad hoc parameters, or why the theory doesn't always work out

There are a few ad hoc parameters that users should be aware of. The reason that these extra parameters are necessary is because there is over-dispersion of the binomial models. Since the binomial model only accounts for variation due to sampling, it doesn't account for all sources of variation. Variation can be due to technical errors and ploidy variation. Thus, a few parameters were designed to deal with the over-dispersion to get the binomial models.

The first parameter used is a variant frequency filter. Much of the over-dispersion arises in the homozygous reference group, probably due to non-constant error rates, so in order to deal with that, alleles with a variant frequency of under 0.02 are not considered for the fitting of the model. This does have an impact on the categorization of loci, and has been tested on data to ensure that the cluster means are a good representation of truth sets.

The second heuristic that we use is the down-sampling of alleles with the total depth > 1000. This is further explained in the limitations section.

In order to try to account for the over-dispersion in the data, there is an adjustment when making the Q-score estimation. AdaptiveGT estimates the variance in the sample population and down-samples total depth at that locus to be roughly equal in variance with the observed variance. Typical depth values that AdaptiveGT downsamples to are around 20. Note that this downsampling step is only used for Q-score estimationa and not genotype determination.

Results

Categorization of genotypes on the Halo dataset seems achieve similar accuracy as before. Here is a comparison of the Happy results from the original and the calls made using AdaptiveGT.

Original

Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score
INDEL ALL 1875 1568 307 2012 202 237 65 46 0.836267 0.886197 0.117793 0.860508
INDEL PASS 1875 1498 377 1788 97 190 37 17 0.798933 0.939299 0.106264 0.863449
SNP ALL 36659 35529 1130 37531 625 1376 55 88 0.969175 0.982713 0.036663 0.975897
SNP PASS 36659 35485 1174 37138 376 1277 55 39 0.967975 0.989515 0.034385 0.978627

AdaptiveGT

Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score
INDEL ALL 1875 1579 296 2048 210 250 71 56 0.842133 0.883204 0.12207 0.86218
INDEL PASS 1875 1498 377 1818 109 204 44 26 0.798933 0.932466 0.112211 0.86055
SNP ALL 36659 35524 1135 37471 568 1378 55 94 0.969039 0.984263 0.036775 0.976592
SNP PASS 36659 35481 1178 37105 346 1277 55 41 0.967866 0.990343 0.034416 0.978975

Limitations

The current model only accounts for variation from sampling. One of the reasons for that is that other sources of variation may differ between samples. For example, variation may come from library preparation method, sequencing platform, different tissues, etc. Currently, for samples with total read depth > 1000, it is down-sampled to 1000 so that we can have finite probabilities (within machine precision) using the binomial or multinomial distribution. This brings the variance expected to be more in line with the actual variance.

This technique is adaptable and extensible for future work.

General

5.3.0

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally