-
Notifications
You must be signed in to change notification settings - Fork 0
AdaptiveGenotyper 5.2.10 Design Document
The Adaptive Genotyper is a command line tool used for adaptive germline calling of genotypes on VCFs. It takes in a somatic, uncrushed VCF, using either the option, -MinVQ 0, or -gvcf true, and reads in all the allele depths and total depths, and fits a three-component binomial mixture model to the population of allele frequencies. Each of the three components corresponds to the homozygous reference, heterozygous and homozygous alternate calls. After fitting the model, each locus is categorized into one of the three categories based on allele support and total depth, and genotype posterior probabilities and Q scores are calculated.
Phred-scaled genotype posterior probabilities are reported on the VCF as GP in the genotype field per VCF 4.1 specifications. Q scores are calculated as the Phred-scaled probability that the genotype is incorrect (i.e. 1 - posterior of that genotype). Note that this is a different definition from the usual likelihood based methods, but still fits within the VCF specifications. AdaptiveGT generates models files that can also be used across different data sets. If a model file is provided, no fitting will be performed–instead categorization into the existing model and calculation of GP and Q scores will be done and a new output VCF will be written.
AdaptiveGT supports configuration of parameters so that its behavior can be fine tuned depending on the application context.
Format: dotnet AdaptiveGenotyper.dll [-options]
Example: dotnet AdaptiveGenotyper.dll -vcf input.genome.gvcf -out /my/outfolder
SDS ID | Specification |
---|---|
SDS-1 | AdaptiveGT shall accept command line arguments as a whitespace-separated list of name and value pairs. For example: |
SDS-2 | If an invalid command is given, AdaptiveGT shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands. |
SDS-3 | AdaptiveGT command line shall be capitalization invariant. |
SDS ID | Specification |
---|---|
SDS-4 | AdaptiveGT shall require the command line arguments listed below: |
Argument Name | Type | Default value | Description |
---|---|---|---|
vcf | string | none | File path for input vcf |
SDS ID | Specification |
---|---|
SDS-5 | AdaptiveGT shall optionally support the command line arguments listed below. |
Argument Name | Type | Default value | Description |
---|---|---|---|
-log | string | AdaptiveGTLog.txt | Log file name |
-o / -out / -outfolder | string | VCF directory | File path for output VCF |
-model / -models | string | none | Models file path. Specifies the file that models are read from. Each model should have 4 lines: Line 1: 3+ doubles that specifies the means of each binomial population for SNVs Line 2: Prior distribution that corresponds to each population. Must have same number of elements as Line 1 Line 3: Means for INDELs, specified in the same format as Line 1 Line 4: Prior distribution of indels, specified in the same format as Line 2 |
SDS ID | Specification |
---|---|
SDS-6 | If an unknown command line argument is encountered, AdaptiveGT shall exit with an error message describing the unknown argument. |
AdaptiveGT requires as input one somatic, uncrushed VCF file. The VCF file should be formatted such that each variant allele has its own line in the VCF file. Pisces output has this format by default. In addition, the input needs to either be a GVCF, with all loci reported, or the option -MinVQ 0 needs to be used such that most variants down to 1% are reported.
SDS ID | Specification |
---|---|
SDS-7 | AdaptiveGT shall require one VCF file as input. |
AdaptiveGT outputs one VCF file, with the same convention and structure as the input file except with crushed loci (each locus will be one line).
SDS ID | Specification |
---|---|
SDS-8 | AdaptiveGT shall produce output files in the same directory as input VCF file unless otherwise specified with the optional argument. |
SDS-9 | AdaptiveGT shall output a gVCF as described in the VCF 4.1 specifications document. |
SDS-10 | AdaptiveGT the output file name shall be the input file name with ".recal" appended to the file name. |
AdaptiveGT reads in the allele depths and total depths at each locus, and fits a binomial mixture model to the population. A histogram of the population of allele frequencies looks something like this (n = 270k):
As can be seen from the distribution, there are roughly three binomial populations with p = 0.01, 0.5, and 1. The exact means of each population of found via the expectation-maximization (EM) algorithm, optimizing for the binomial mixture model. Then, using each mean and prior distribution (which is the number of loci belonging to each component divided by the total number of loci), we can calculate a posterior probability for each locus belonging to each distribution. The posterior probability is calculated from the following equation:
P(x|theta) is the likelihood of observing that locus and P(theta) is the prior probability (estimated in the way specified above). P(x), or the normalizing factor, is calculated by summing the numerator of all possible categories (3, or 4, as explained below).
The highest posterior probability is taken as that locus's true category, and the posteriors are reported as GP in the resulting VCF. The Q score is calculated by taking 1 - posterior of its actual category (the highest posterior). Both numbers are Phred scaled and the Q score is rounded to the nearest integer per VCF 4.1 specifications.
Separate mixture models are used for indels and SNVs because aligners tend to produce different biases for each. Empirically, with a large enough data set such as Halo, they converge to different means, indicating that they are indeed different distributions.
There are a few ad hoc parameters that users should be aware of. The reason that these extra parameters are necessary is because there is over-dispersion of the binomial models. Since the binomial model only accounts for variation due to sampling, it doesn't account for all sources of variation. Variation can be due to technical errors and ploidy variation. Thus, a few parameters were designed to deal with the over-dispersion to get the binomial models.
The first parameter used is a variant frequency filter. Much of the over-dispersion arises in the homozygous reference group, probably due to non-constant error rates, so in order to deal with that, alleles with a variant frequency of under 0.02 are not considered for the fitting of the model. This does have an impact on the categorization of loci, and has been tested on data to ensure that the cluster means are a good representation of truth sets.
The second heuristic that we use is the down-sampling of alleles with the total depth > 1000. This is further explained in the limitations section.
In order to try to account for the over-dispersion in the data, there is an adjustment when making the Q-score estimation. AdaptiveGT estimates the variance in the sample population and down-samples total depth at that locus to be roughly equal in variance with the observed variance. Typical depth values that AdaptiveGT downsamples to are around 20. Note that this downsampling step is only used for Q-score estimationa and not genotype determination.
Categorization of genotypes on the Halo dataset seems achieve similar accuracy as before. Here is a comparison of the Happy results from the original and the calls made using AdaptiveGT.
Original
Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INDEL | ALL | 1875 | 1568 | 307 | 2012 | 202 | 237 | 65 | 46 | 0.836267 | 0.886197 | 0.117793 | 0.860508 |
INDEL | PASS | 1875 | 1498 | 377 | 1788 | 97 | 190 | 37 | 17 | 0.798933 | 0.939299 | 0.106264 | 0.863449 |
SNP | ALL | 36659 | 35529 | 1130 | 37531 | 625 | 1376 | 55 | 88 | 0.969175 | 0.982713 | 0.036663 | 0.975897 |
SNP | PASS | 36659 | 35485 | 1174 | 37138 | 376 | 1277 | 55 | 39 | 0.967975 | 0.989515 | 0.034385 | 0.978627 |
AdaptiveGT
Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INDEL | ALL | 1875 | 1579 | 296 | 2048 | 210 | 250 | 71 | 56 | 0.842133 | 0.883204 | 0.12207 | 0.86218 |
INDEL | PASS | 1875 | 1498 | 377 | 1818 | 109 | 204 | 44 | 26 | 0.798933 | 0.932466 | 0.112211 | 0.86055 |
SNP | ALL | 36659 | 35524 | 1135 | 37471 | 568 | 1378 | 55 | 94 | 0.969039 | 0.984263 | 0.036775 | 0.976592 |
SNP | PASS | 36659 | 35481 | 1178 | 37105 | 346 | 1277 | 55 | 41 | 0.967866 | 0.990343 | 0.034416 | 0.978975 |
The current model only accounts for variation from sampling. One of the reasons for that is that other sources of variation may differ between samples. For example, variation may come from library preparation method, sequencing platform, different tissues, etc. Currently, for samples with total read depth > 1000, it is down-sampled to 1000 so that we can have finite probabilities (within machine precision) using the binomial or multinomial distribution. This brings the variance expected to be more in line with the actual variance.
This technique is adaptable and extensible for future work.
- Pisces 5.3.0 Quick Start
- Pisces 5.3.0 Design Document
- Pisces 5.3.0 Supported Options
- Scylla 5.3.0 Design Document
- Stitcher 5.3.0 Design Document
- VQR 5.3.0 Design Document
- VennVcf 5.3.0 Design Document
- Gemini 5.3.0 Design Document
- AdaptiveGenotyper 5.3.0 Design Document
- Pisces Tools 5.3.0
- Suggested Pipeline Configuration 5.3.0
- Pisces 5.2.10 Design Document
- Pisces 5.2.10 Supported Options
- Scylla 5.2.10 Design Document
- Stitcher 5.2.10 Design Document
- VQR 5.2.10 Design Document
- VennVcf 5.2.10 Design Document
- Gemini 5.2.10 Design Document
- AdaptiveGenotyper 5.2.10 Design Document
- Pisces Tools 5.2.10
- Suggested Pipeline Configuration 5.2.10
- Pisces 5.2.9 Quick Start
- Pisces 5.2.9 Design Document
- Pisces 5.2.9 Supported Options
- Scylla 5.2.9 Design Document
- Stitcher 5.2.9 Design Document
- VQR 5.2.9 Design Document
- VennVcf 5.2.9 Design Document
- Pisces Tools 5.2.9
- Suggested Pipeline Configuration 5.2.9
- Pisces 5.2.7 Quick Start
- Pisces 5.2.7 Design Document
- Pisces 5.2.7 Supported Options
- Scylla 5.2.7 Design Document
- Stitcher 5.2.7 Design Document
- VQR 5.2.7 Design Document
- VennVcf 5.2.7 Design Document
- Pisces Tools 5.2.7
- Suggested Pipeline Configuration 5.2.7
- Pisces 5.2.5 Design Document
- Pisces 5.2.5 Supported Options
- Scylla 5.2.5 Design Document
- Stitcher 5.2.5 Design Document
- VQR 5.2.5 Design Document
- Suggested Pipeline Configuration 5.2.5
- Pisces 5.2.0 Design Document
- Pisces 5.2.0 Supported Options
- Scylla 5.2.0 Design Document
- Stitcher 5.2.0 Design Document
- VQR 5.2.0 Design Document
- Suggested Pipeline Configuration 5.2.0
- Pisces Suite 5.2.0 Known Issues and Limitations