Skip to content

Validation Methods

Mitch Bekritsky edited this page Oct 17, 2018 · 7 revisions

Table of Contents

Summary

Calls provided by Polaris undergo population genetic and Mendelian validation before being released. This page contains a summary of those methods.

Population validation methods

Population validation methods leverage the various cohorts sequenced as part of Polaris to validate variant calls. These methods assess overall genotyping accuracy, but might still be insensitive to occasional miscalls in a handful of individuals.

Population validation methods are capable of determining the presence and accuracy of common variants only, but cannot do the same for rare or private variants.

Hardy Weinberg equilibrium

Common variants are assessed for Hardy Weinberg Equilibrium (HWE) using Fischer's exact test2 or a chi-squared test. We assume that common variants are not subject to the typical factors expected to perturb HWE, including:

  • Selection: we assume common variants are not under selective pressure.
  • Genetic drift: we similarly assume that sufficiently common mutations are unaffected by genetic drift on short timescales.
  • Random mating: we assume that with respect to the variants being evaluated, assortative mating is not a factor.
  • Mutation: we assume that variants are generally not due to recurrent mutation. This can be independently assessed using linkage disequilibrium.

When HWE is violated, our initial assumption is always that it is due to genotyping error rather than any other factor. In some cases, HWE may be violated due to recurrent mutation. In these situations, we would rely on other methods for variant call validation.

Populations for HWE assessment

When evaluating HWE for our common variants, we are somewhat limited by the sample size and diversity of the Polaris cohorts. Although many events are in HWE within the Polaris cohort, we can also assess HWE on an a less diverse and larger European cohort of several thousand samples. While we are able to report HWE p-values from this population, we are unable to release any variant calls or sequencing data.

Hardy Weinberg performance graphics

When assessing HWE, we typically use two types of plots: p-value histograms and ternary plots. While a p-value histogram is a useful means of communicating the uniformity of our distribution, the ternary plot gives richer information about why loci might be out of HWE. Here is a sample ternary plot:

HWE ternary plot

In this example, we are representing HWE p-values for approximately 10,000 SV deletions assessed in our Polaris Diversity cohort (actually from our PG-pop candidate set). Typically, HWE loci (in blue) are in an arc through the center of the plot. Loci out of HWE, in red, can have different meanings.

For instance, near the very upper part of the plot are loci where nearly all samples are heterozygous, which is very unlikely — it would imply everyone had one parent homozygous for the reference allele and one homozygous for the alternate allele!

On the lower part of the plot (below the blue arc) are samples with fewer heterozygotes than expected. At the very lower right are loci which are primarily homref — these might be loci where either our Polaris cohorts are underpowered to assess HWE due to low minor allele frequencies, or might have population structure that would require a less diverse population for proper HWE assessment.

Mendelian validation methods

Mendelian validation methods use the pedigrees and eventually trios sequenced as part of the Platinum Genomes and Polaris projects to validate variant calls. These methods are capable of validating variants independent of minor allele frequency, which makes them a powerful complement to population genetic validation methods.

Mendelian validation methods provide limited resources for variants that occur infrequently within any given individual, such as large CNVs and SVs. For these types of variants, identifying common examples that can be validated using population genetic methods is an especially powerful way of providing a broader dataset for benchmarking and annotation.

In addition, Mendelian validation methods are also very sensitive to genotyping error. Miscalls in a single individual in a pedigree could lead to a candidate being invalidated.

Pedigree consistency

We currently use the Platinum Genomes to assess pedigree consistency, in a manner that's generally consistent with the methods described in the Platinum Genomes manuscript for evaluating small variants.1 Briefly, in a pedigree where we have determined which two parental haplotypes have been transmitted to each child, we can assess whether there is a unique assignment of observed alleles to parental haplotypes that is consistent with the genotypes observed in the pedigree at a locus. Invalid variants either have no assignments that produce a pedigree consistent outcome or have multiple valid assignments. Larger pedigrees constrain the possible assignments of alleles to haplotypes and are therefore more likely to produce unique and valid configurations.

Here are examples of pedigree consistent and pedigree inconsistent transmission on a toy pedigree of two parents and two children.

Pedigree consistency

We would consider a variant out of HWE to be valid if it is pedigree consistent, although efforts should be undertaken to understand the conflicting results.

Un-assessable loci

There are two types of loci that we are incapable of assessing for pedigree consistency:

  1. Uniformly homozygous variants: any combination of allele to haplotype assignments produces a valid transmission.
  2. Variants on un-transmitted parental haplotypes: there is no means of determining Mendelian inheritance patterns for these variants.

Special considerations for CNVs

For SVs, where genotypes are given using the GT FORMAT tag in a VCF, pedigree consistency assessment methods are identical to those used for small variants. However, when assessing CNVs, where genotypes are given by the CN FORMAT tag, there is no per-allele variant assignment, so we must consider all possible allelic configurations that could generate a given copy number state. This leads to two important outcomes:

  1. For copy numbers greater than 2, we must consider all copy number allele configurations. For instance, if CN=3, the possible configurations are (0, 3), (1, 2), (2, 1), and (3, 0). In a sufficiently large pedigree, we should be able to resolve the copy number for each haplotype.
  2. Unless otherwise indicated, CN=2 does not necessarily imply a diploid copy number, particularly in small pedigrees. It is still possible that one allele has 2 copies of a region, while the other has 0. As with CN > 2, larger pedigrees can let us determine with greater confidence whether a locus is truly diploid.

References

  1. Eberle, et al (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27:157-164. doi:10.1101/gr.210500.116
  2. Wiggington, et al (2005) A Note on Exact Tests of Hardy-Weinberg Equilibrium. AJHG 76:887-893. doi:10.1086/429864